This R markdown file will go through the processes of analyzing the data for the VAMP-seq manuscript, using 4 files as the starting point. These files are the PTEN and TPMT variant score data tables, as well as the PTEN and TPMT positional score data tables. These files were provided as supplementary data files from the journal, and can also be obtained at our GitHub repo. Please place these files in the same directory as this R Markdown file to allow the analyses to proceed.
Furthermore, create a directory titled “Output” in the same directory as this R Markdown file to create the graphics files used to generate the figures.
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
graphics.off()
markdown_directory <- getwd()
set.seed(12345)
pten_variants <- read.table(file = "PTEN_variant_data.tsv", stringsAsFactors = FALSE, sep = "\t", header = TRUE)
tpmt_variants <- read.table(file = "TPMT_variant_data.tsv", stringsAsFactors = FALSE, sep = "\t", header = TRUE)
pten_positions <- read.table(file = "PTEN_positional_data.tsv", stringsAsFactors = FALSE, sep = "\t", header = TRUE)
tpmt_positions <- read.table(file = "TPMT_positional_data.tsv", stringsAsFactors = FALSE, sep = "\t", header = TRUE)
Also, load the necessary packages, and set up the default theme that will be used in all plots
Here, we are also changing the factor levels of the amino acids to reflect a more informative order:
pten_positions$start <- factor(pten_positions$start, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
pten_variants$start <- factor(pten_variants$start, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
pten_variants$end <- factor(pten_variants$end, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
tpmt_positions$start <- factor(tpmt_positions$start, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
tpmt_variants$start <- factor(tpmt_variants$start, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
tpmt_variants$end <- factor(tpmt_variants$end, levels = c("A","V","I","L","M","F","Y","W","S","T","N","Q","C","G","P","R","H","K","D","E","X"))
Here are some basic stats on the abundance classes for the PTEN and TPMT datatables we just imported: (note: variants with single amino acid substitutions were referred to as “missense” in the “class” column, though true missense variants achievable through single nucleotide variation are a subset of these with 1 noted in the “snv” column)
## [1] "We quantified the abundance of 7801 variants of two disease-related proteins, PTEN and TPMT"
## [1] "Furthermore, we identified 1138 low-abundance PTEN variants that could be pathogenic and 777 low-abundance TPMT variants that could alter drug metabolism"
## [1] "There are 156 synonymous PTEN variants with abundance scores"
## [1] "There are 140 nonsense PTEN variants with abundance scores"
## [1] "There are 4112 single amino acid PTEN variants with abundance scores"
## [1] "There are 1366 missense (possible by SNV) PTEN variants with abundance scores"
## [1] "There are 139 synonymous TPMT variants with abundance scores"
## [1] "There are 199 nonsense TPMT variants with abundance scores"
## [1] "There are 3689 single amino acid TPMT variants with abundance scores"
## [1] "There are 1133 missense (possible by SNV) TPMT variants with abundance scores"
To validate that we observe different EGFP:mCherry ratios depending on the fused variant, we used EGFP with WT or uknown unstable variants, as well as numerous randomly picked variants expected to span the range of the ratio. The geometric mean of the distribution of ratios for each variant was used for the comparison.
Comparison of PTEN VAMP-seq scores with individually assessed EGFP:mCherry variants (WT-normalized geometric means plotted)
pten_variants_individual <- subset(pten_variants, !is.na(egfp_geomean))
pten_stability_control_variants <- c("WT","Y68H","C136R","D252G","L345Q","C124S")
pten_variants_individual$variant <- factor(pten_variants_individual$variant, levels = c(pten_stability_control_variants,pten_variants_individual$variant[!(pten_variants_individual$variant %in% pten_stability_control_variants)]))
PTEN_individual_plot <- ggplot() + geom_errorbar(data = pten_variants_individual, aes(x = variant, ymin = egfp_geomean_lower_ci, ymax = egfp_geomean_upper_ci), color = "black") +
geom_point(data = pten_variants_individual, aes(x = variant, y = egfp_geomean), color = "black", size = 2) +
geom_point(data = pten_variants_individual, aes(x = variant, y = egfp_geomean), color = "red", size = 1) +
scale_y_continuous(breaks = c(0,0.25,0.5,0.75,1,1.25,1.5)) + theme(axis.text.x = element_text(angle = -90, vjust = 0.5, hjust = 0)) +
xlab(NULL) + ylab("Mean WT-normalized\nEGFP / mCherry ratio")
ggsave(file = "Output/PTEN_individual_plot.pdf", PTEN_individual_plot, height = 3, width = 5)
PTEN_individual_plot
Comparison of TPMT VAMP-seq scores with individually assessed EGFP:mCherry variants (WT-normalized geometric means plotted)
tpmt_individual <- subset(tpmt_variants, !is.na(single_WTNorm))[,c("variant","single_WTNorm","single_StErr")]
tpmt_individual <- rbind(tpmt_individual, c("WT","1","0"))
colnames(tpmt_individual) <- c("variant","mean","se")
tpmt_individual$mean <- as.numeric(tpmt_individual$mean)
tpmt_individual$se <- as.numeric(tpmt_individual$se)
tpmt_individual$lower_ci <- tpmt_individual$mean - qnorm(0.975) * tpmt_individual$se
tpmt_individual$upper_ci <- tpmt_individual$mean + qnorm(0.975) * tpmt_individual$se
tpmt_individual$position <- gsub("[^0-9]", "", tpmt_individual$variant)
tpmt_individual[tpmt_individual$position == "","position"] <- 0
tpmt_individual$position <- as.numeric(tpmt_individual$position)
tpmt_stability_control_variants <- c("WT","A80P","A154T","Y240C")
tpmt_individual$variant <- factor(tpmt_individual$variant, levels = c(tpmt_stability_control_variants,tpmt_individual$variant[!(tpmt_individual$variant %in% tpmt_stability_control_variants)]))
TPMT_individual_plot <- ggplot() + geom_errorbar(data = tpmt_individual, aes(x = variant, ymax = upper_ci, ymin = lower_ci), color = "black") +
geom_point(data = tpmt_individual, aes(x = variant, y = mean), color = "black", size = 2) +
geom_point(data = tpmt_individual, aes(x = variant, y = mean), color = "red", size = 1) +
scale_y_continuous(breaks = c(0,0.25,0.5,0.75,1,1.25,1.5)) + theme(axis.text.x = element_text(angle = -90, vjust = 0.5, hjust = 0)) +
xlab(NULL) + ylab("Mean WT-normalized\nEGFP / mCherry ratio")
ggsave(file = "Output/TPMT_individual_plot.pdf", TPMT_individual_plot, height = 3, width = 4.5)
TPMT_individual_plot
Paste some stats amongst the PTEN and TPMT WTs and controls
## [1] "PTEN WT to low abundance controls fold differene: 6.96"
## [1] "TPMT WT to low abundance controls fold differene: 2.4326"
## [1] "TPMT WT to low abundance control A80P fold differene: 4.4397"
## [1] "TPMT WT to low abundance control Y240C fold differene: 1.5457"
In the case of PTEN, we also tested fusion of a shorter 15-amino acid fragment of EGFP, rather than the entire protein itself, to see if size of the fused sequence with the PTEN protein altered these ratios at all.
## [1] "PTEN full-EGFP:mCherry and splitGFP:mCherry fusion ratio correlations, Pearson's r: 0.98"
## [1] "PTEN full-EGFP:mCherry and splitGFP:mCherry fusion ratio correlations, Spearman's R: 0.94"
PTEN_egfp_sgfp_plot <- ggplot() + geom_hline(yintercept = 1, color = "grey50") + geom_vline(xintercept = 1, color = "grey50") +
geom_abline(slope = lm(pten_variants$sgfp_geomean ~ pten_variants$egfp_geomean)$coefficient[2], intercept = lm(pten_variants$sgfp_geomean ~ pten_variants$egfp_geomean)$coefficient[1], linetype = 2, color = "grey50") +
geom_point(data = pten_variants, aes(x = egfp_geomean, y = sgfp_geomean)) +
geom_text(data = pten_variants, aes(x = egfp_geomean, y = sgfp_geomean + 0.03), label = pten_variants$variant, color = "red") +
annotate("text", x = min(subset(pten_variants, !is.na(sgfp_geomean))$egfp_geomean, na.rm = TRUE), y = max(subset(pten_variants, !is.na(sgfp_geomean))$sgfp_geomean, na.rm = TRUE) + 0.03,
label = paste("r:",round(sqrt(summary(lm(pten_variants$sgfp_geomean ~ pten_variants$egfp_geomean))$r.squared),2)), hjust = 0.1) +
annotate("text", x = min(subset(pten_variants, !is.na(sgfp_geomean))$egfp_geomean, na.rm = TRUE), y = max(subset(pten_variants, !is.na(sgfp_geomean))$sgfp_geomean - 0.1, na.rm = TRUE) + 0.03,
label = paste("rho:",round(cor(pten_variants$sgfp_geomean, pten_variants$egfp_geomean, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0.1, parse = TRUE) +
scale_x_continuous(limits = c(0,1.1), breaks = c(0,0.25,0.5,0.75,1), expand = c(0,0)) + scale_y_continuous(limits = c(0,1.1), breaks = c(0,0.25,0.5,0.75,1), expand = c(0,0)) +
xlab("WT-normalized EGFP/mCherry ratio") + ylab("WT-normalized split-EGFP/mCherry ratio")
ggsave(file = "Output/PTEN_egfp_sgfp_plot.pdf", PTEN_egfp_sgfp_plot, height = 4, width = 4)
PTEN_egfp_sgfp_plot
We also wanted to ensure that the replicate experiments correlated with each other. Thus, we made all-by-all comparisons of the replicate experiments for PTEN and TPMT.
scatterplot_correlation_fxn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.01) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
p
}
pten_variants_min <- pten_variants[,c("class","score1","score2","score3","score4","score5","score6","score7","score8")]
tpmt_variants_min <- tpmt_variants[,c("class","score1","score2","score3","score4","score5","score6","score7","score8")]
## PTEN experiment correlations
pten_pearson_cor_matrix <- as.matrix(cor(pten_variants_min[,c("score1","score2","score3","score4","score5","score6","score7","score8")], use="pairwise.complete.obs", method = "pearson"))
pten_spearman_cor_matrix <- as.matrix(cor(pten_variants_min[,c("score1","score2","score3","score4","score5","score6","score7","score8")], use="pairwise.complete.obs", method = "spearman"))
pten_count_matrix <- as.matrix(cor(pten_variants_min[,c("score1","score2","score3","score4","score5","score6","score7","score8")], use="pairwise.complete.obs", method = "pearson"))
pten_scatterplot_matrix_pearson <- ggpairs(pten_variants_min,columns = c("score1","score2","score3","score4","score5","score6","score7","score8"), lower = list(continuous = scatterplot_correlation_fxn), method = "pearson")
ggsave(file = "Output/PTEN_scatterplot_matrix_pearson.png", plot = pten_scatterplot_matrix_pearson, device = "png", height =7, width = 7)
pten_pearson_cor_matrix
## score1 score2 score3 score4 score5 score6
## score1 1.0000000 0.6373224 0.6396699 0.5656064 0.4588173 0.4878436
## score2 0.6373224 1.0000000 0.8527773 0.7740414 0.4763322 0.5780968
## score3 0.6396699 0.8527773 1.0000000 0.8622124 0.4945968 0.5852729
## score4 0.5656064 0.7740414 0.8622124 1.0000000 0.4545757 0.5080155
## score5 0.4588173 0.4763322 0.4945968 0.4545757 1.0000000 0.5532682
## score6 0.4878436 0.5780968 0.5852729 0.5080155 0.5532682 1.0000000
## score7 0.6249790 0.7244336 0.7304764 0.6365151 0.8120215 0.6879506
## score8 0.6535362 0.7366443 0.7519920 0.6484557 0.5005498 0.5828860
## score7 score8
## score1 0.6249790 0.6535362
## score2 0.7244336 0.7366443
## score3 0.7304764 0.7519920
## score4 0.6365151 0.6484557
## score5 0.8120215 0.5005498
## score6 0.6879506 0.5828860
## score7 1.0000000 0.7251157
## score8 0.7251157 1.0000000
pten_spearman_cor_matrix
## score1 score2 score3 score4 score5 score6
## score1 1.0000000 0.6025425 0.6025281 0.5472287 0.4569725 0.4765467
## score2 0.6025425 1.0000000 0.8389861 0.7660957 0.4781095 0.5592512
## score3 0.6025281 0.8389861 1.0000000 0.8530996 0.4919268 0.5700722
## score4 0.5472287 0.7660957 0.8530996 1.0000000 0.4658215 0.5079552
## score5 0.4569725 0.4781095 0.4919268 0.4658215 1.0000000 0.5604804
## score6 0.4765467 0.5592512 0.5700722 0.5079552 0.5604804 1.0000000
## score7 0.5910372 0.6948333 0.7016350 0.6215694 0.8138941 0.6827469
## score8 0.6223003 0.7073999 0.7196791 0.6295328 0.4949137 0.5622754
## score7 score8
## score1 0.5910372 0.6223003
## score2 0.6948333 0.7073999
## score3 0.7016350 0.7196791
## score4 0.6215694 0.6295328
## score5 0.8138941 0.4949137
## score6 0.6827469 0.5622754
## score7 1.0000000 0.6916876
## score8 0.6916876 1.0000000
pten_scatterplot_matrix_pearson
pairwise_names <- c("score1","score2","score3","score4","score5","score6","score7","score8")
pten_pairwise_count_vector <- c()
for(x in 1:(length(pairwise_names)-1)){
for(y in (x+1):length(pairwise_names)){
print(paste(x,y))
pairwise_count <- nrow(subset(pten_variants_min, !is.na(pten_variants_min[,pairwise_names[x]]) & !is.na(pten_variants_min[,pairwise_names[y]])))
pten_pairwise_count_vector <- c(pten_pairwise_count_vector,pairwise_count)
}
}
## [1] "1 2"
## [1] "1 3"
## [1] "1 4"
## [1] "1 5"
## [1] "1 6"
## [1] "1 7"
## [1] "1 8"
## [1] "2 3"
## [1] "2 4"
## [1] "2 5"
## [1] "2 6"
## [1] "2 7"
## [1] "2 8"
## [1] "3 4"
## [1] "3 5"
## [1] "3 6"
## [1] "3 7"
## [1] "3 8"
## [1] "4 5"
## [1] "4 6"
## [1] "4 7"
## [1] "4 8"
## [1] "5 6"
## [1] "5 7"
## [1] "5 8"
## [1] "6 7"
## [1] "6 8"
## [1] "7 8"
print(paste("The PTEN replicate pairwise correlations were based on n values of:",toString(pten_pairwise_count_vector)))
## [1] "The PTEN replicate pairwise correlations were based on n values of: 2336, 2324, 2205, 2192, 1829, 1751, 2345, 3528, 3273, 2955, 2337, 2277, 3112, 3338, 2910, 2308, 2253, 3061, 2736, 2189, 2136, 2856, 2395, 2510, 2940, 2004, 2332, 2249"
## TPMT experiment correlations
tpmt_pearson_cor_matrix <- as.matrix(cor(tpmt_variants_min[,c("score1","score2","score3","score4","score5","score6","score7","score8")], use="pairwise.complete.obs", method = "pearson"))
tpmt_spearman_cor_matrix <- as.matrix(cor(tpmt_variants_min[,c("score1","score2","score3","score4","score5","score6","score7","score8")], use="pairwise.complete.obs", method = "spearman"))
# Remove data from TPMT library preparations with almost nonexistant overlapping variants
tpmt_pearson_cor_matrix[1:4,5:6] <- NA; tpmt_pearson_cor_matrix[5:6,1:4] <- NA
tpmt_spearman_cor_matrix[1:4,5:6] <- NA; tpmt_spearman_cor_matrix[5:6,1:4] <- NA
tpmt_scatterplot_matrix_pearson <- ggpairs(tpmt_variants_min,columns = c("score1","score2","score3","score4","score5","score6","score7","score8"), lower = list(continuous = scatterplot_correlation_fxn), method = "pearson")
ggsave(file = "Output/TPMT_scatterplot_matrix_pearson.png", tpmt_scatterplot_matrix_pearson, height = 7, width = 7)
tpmt_pearson_cor_matrix
## score1 score2 score3 score4 score5 score6
## score1 1.0000000 0.8447347 0.6174404 0.6072204 NA NA
## score2 0.8447347 1.0000000 0.6221349 0.6069148 NA NA
## score3 0.6174404 0.6221349 1.0000000 0.9108928 NA NA
## score4 0.6072204 0.6069148 0.9108928 1.0000000 NA NA
## score5 NA NA NA NA 1.0000000 0.8332401
## score6 NA NA NA NA 0.8332401 1.0000000
## score7 0.5675970 0.5630967 0.8556608 0.8270125 0.7030676 0.7023773
## score8 0.5794325 0.5776146 0.8671174 0.8455605 0.7495263 0.7456916
## score7 score8
## score1 0.5675970 0.5794325
## score2 0.5630967 0.5776146
## score3 0.8556608 0.8671174
## score4 0.8270125 0.8455605
## score5 0.7030676 0.7495263
## score6 0.7023773 0.7456916
## score7 1.0000000 0.8749906
## score8 0.8749906 1.0000000
tpmt_spearman_cor_matrix
## score1 score2 score3 score4 score5 score6
## score1 1.0000000 0.7152874 0.5248938 0.5265402 NA NA
## score2 0.7152874 1.0000000 0.5092189 0.5026442 NA NA
## score3 0.5248938 0.5092189 1.0000000 0.8943974 NA NA
## score4 0.5265402 0.5026442 0.8943974 1.0000000 NA NA
## score5 NA NA NA NA 1.0000000 0.8184053
## score6 NA NA NA NA 0.8184053 1.0000000
## score7 0.5016199 0.4739139 0.8349242 0.8098054 0.6817354 0.6843409
## score8 0.5016591 0.4776924 0.8460794 0.8303096 0.7416413 0.7413822
## score7 score8
## score1 0.5016199 0.5016591
## score2 0.4739139 0.4776924
## score3 0.8349242 0.8460794
## score4 0.8098054 0.8303096
## score5 0.6817354 0.7416413
## score6 0.6843409 0.7413822
## score7 1.0000000 0.8488551
## score8 0.8488551 1.0000000
tpmt_scatterplot_matrix_pearson
tpmt_pairwise_count_vector <- c()
for(x in 1:(length(pairwise_names)-1)){
for(y in (x+1):length(pairwise_names)){
pairwise_count <- nrow(subset(tpmt_variants_min, !is.na(tpmt_variants_min[,pairwise_names[x]]) & !is.na(tpmt_variants_min[,pairwise_names[y]])))
tpmt_pairwise_count_vector <- c(tpmt_pairwise_count_vector,pairwise_count)
print(paste(x,y,pairwise_count))
}
}
## [1] "1 2 3135"
## [1] "1 3 3036"
## [1] "1 4 2996"
## [1] "1 5 47"
## [1] "1 6 47"
## [1] "1 7 3064"
## [1] "1 8 3067"
## [1] "2 3 3028"
## [1] "2 4 2986"
## [1] "2 5 46"
## [1] "2 6 46"
## [1] "2 7 3048"
## [1] "2 8 3054"
## [1] "3 4 3003"
## [1] "3 5 43"
## [1] "3 6 43"
## [1] "3 7 3022"
## [1] "3 8 3023"
## [1] "4 5 41"
## [1] "4 6 41"
## [1] "4 7 2990"
## [1] "4 8 2990"
## [1] "5 6 877"
## [1] "5 7 825"
## [1] "5 8 830"
## [1] "6 7 825"
## [1] "6 8 830"
## [1] "7 8 3819"
print(paste("The TPMT replicate pairwise correlations were based on n values of:",toString(tpmt_pairwise_count_vector)))
## [1] "The TPMT replicate pairwise correlations were based on n values of: 3135, 3036, 2996, 47, 47, 3064, 3067, 3028, 2986, 46, 46, 3048, 3054, 3003, 43, 43, 3022, 3023, 41, 41, 2990, 2990, 877, 825, 830, 825, 830, 3819"
## [1] "Mean PTEN replicate correlation Pearson's r: 0.63"
## [1] "Median PTEN replicate correlation Pearson's r: 0.64"
## [1] "Mean PTEN replicate correlation Spearman's r: 0.62"
## [1] "Median PTEN replicate correlation Spearman's r: 0.6"
## [1] "Mean TPMT replicate correlation Pearson's r: 0.73"
## [1] "Median TPMT replicate correlation Pearson's r: 0.72"
## [1] "Mean TPMT replicate correlation Spearman's r: 0.67"
## [1] "Median TPMT replicate correlation Spearman's r: 0.7"
With the initial quality of the data assessed, we tried to summarize how the results looked at the protein level. A critical thresholds we repeatedly used for PTEN classification was the 5% lowest synonymous variant score. This is more stringent than using the lower-bound of the 95% confidence interval of the synonymous distribution, assuming a normal distribution.
The lowest 10% of missense variants and the median of the synonymous variants were critical values used throughout the analysis as well.
pten_synonymous_5th <- quantile(subset(pten_variants, class == "synonymous")$score, 0.05)
paste("The threshold value above which the top 95% of PTEN synonymous variant scores reside, which is used in the analysis:", round(pten_synonymous_5th,2))
## [1] "The threshold value above which the top 95% of PTEN synonymous variant scores reside, which is used in the analysis: 0.71"
paste("The lower bound of the 95% confidence interval of the PTEN synonymous distribution score assuming a normal distribution:", round(mean(subset(pten_variants, class == "synonymous")$score) - qnorm(0.975) * sd(subset(pten_variants, class == "synonymous")$score)/sqrt(8-1),2))
## [1] "The lower bound of the 95% confidence interval of the PTEN synonymous distribution score assuming a normal distribution: 0.83"
tpmt_synonymous_5th <- quantile(subset(tpmt_variants, class == "synonymous")$score, 0.05)
paste("The threshold value above which the top 95% of TPMT synonymous variant scores reside, which is used in the analysis:", round(tpmt_synonymous_5th,2))
## [1] "The threshold value above which the top 95% of TPMT synonymous variant scores reside, which is used in the analysis: 0.72"
The abundance scores of nonsense variants were plotting along the positions of the PTEN and TPMT proteins.
pten_trunc_score_position_plot <- ggplot() + geom_point(data = subset(pten_variants, class == "nonsense"), aes(x = as.numeric(position), y = score), size = 0.5) +
ylab("Stability score") + xlab("Position in protein") + geom_hline(yintercept = pten_variants[pten_variants$class == "wt","score"], color = "blue") +
scale_y_continuous(limits = c(-0.5,1.3), breaks = c(-0.4,0,0.4,0.8,1.2)) + scale_x_continuous(breaks = c(0,100,200,300,400))
ggsave(file = "Output/PTEN_trunc_score_position_plot.pdf", pten_trunc_score_position_plot, height = 1.5, width = 2.25)
tpmt_trunc_score_position_plot <- ggplot() + geom_point(data = subset(tpmt_variants, class == "nonsense"), aes(x = as.numeric(position), y = score), size = 0.5) +
ylab("Stability score") + xlab("Position in protein") + geom_hline(yintercept = pten_variants[pten_variants$class == "wt","score"], color = "blue") +
scale_y_continuous(limits = c(-0.5,1.3), breaks = c(-0.4,0,0.4,0.8,1.2)) + scale_x_continuous(breaks = c(0,50,100,150,200))
ggsave(file = "Output/TPMT_trunc_score_position_plot.pdf", tpmt_trunc_score_position_plot, height = 1.5, width = 2.25)
print(pten_trunc_score_position_plot)
print(tpmt_trunc_score_position_plot)
To summarize the data, heatmaps for the abundance scores of both proteins were produced.
First, here is the heatmap for PTEN. The WT side-chains at each position are marked by the black square, and colored according to the WT score. Positions that did not have any data are colored grey. Please see the gradients used to color the single amino acid variants in the density plot created in the code chunk below this one for the color key.
These are the separate score distributions for synonymous, nonsense, and single amino acid variants. Synonymous and nonsense variants are colored dark red and dark blue, respectively. The single amino acid variants were colored according to their score, with the single amino acids with the 10% lowest scores colored dark blue, going in a gradient toward the WT value of 1 which was colored white, with variants with scores higher than 1 going in a color gradient going toward red.
pten_m <- ggplot(data = subset(pten_variants, class == "missense"), aes(x = score, y = ..scaled..))
pten_m <- pten_m + geom_density()
pten_q <- ggplot_build(pten_m)$data[[1]]
pten_q_low <- subset(pten_q, x <= pten_lower_bound)
pten_q_mid <- subset(pten_q, x > pten_lower_bound & x < 1)
pten_q_high <- subset(pten_q, x >= 1)
pten_heatmap_density <- ggplot() +
geom_segment(data = pten_q_low , aes(x = x, y = y, xend = x, yend = 0), colour = "blue", size = 2) +
geom_segment(data = pten_q_mid, aes(x = x, y = y, xend = x, yend = 0, colour = x), size = 2) +
geom_segment(data = pten_q_high, aes(x = x, y = y, xend = x, yend = 0, colour = x), size = 2) +
geom_density(data = subset(pten_variants, class == "missense"), aes(x = score, y = ..scaled..), size = 1) +
scale_color_gradientn(colours = c("blue","white","red"), values = rescale(c(pten_lower_bound,1,2)), limits=c(pten_lower_bound,2)) +
scale_x_continuous(limits = c(-0.8, 2), breaks = seq(-0.4,2,0.4), expand = c(0,0)) +
scale_y_continuous(limits = c(0,1.2), breaks = c(0,0.4,0.8,1.2), expand = c(0,0.002)) + xlab("Variant abundance score") + ylab("Density") +
theme(panel.background = element_rect(fill = "white"), legend.position = "none") +
geom_density(data = subset(pten_variants, class == "synonymous"), aes(x = score, y = ..scaled..), color = "darkred", alpha = 0.4, size = 1, linetype = 2) +
geom_density(data = subset(pten_variants, class == "nonsense"), aes(x = score, y = ..scaled..), color = "darkblue", alpha = 0.4, size = 1, linetype = 2) +
geom_vline(xintercept = quantile(subset(pten_variants, class == "synonymous")$score,0.05), linetype = 2)
ggsave(file = "Output/PTEN_heatmap_density.pdf", pten_heatmap_density, height = 1.2, width = 3)
print(pten_heatmap_density)
Analagous heatmap and density plots for TPMT. See above PTEN chunks for details.
Here is the TPMT density plot, with the single-amino acid variant color gradient serving as the heatmap color legend.
tpmt_m <- ggplot(data = subset(tpmt_variants, class == "missense"), aes(x = score, y = ..scaled..))
tpmt_m <- tpmt_m + geom_density()
tpmt_q <- ggplot_build(tpmt_m)$data[[1]]
tpmt_q_low <- subset(tpmt_q, x <= tpmt_lower_bound)
tpmt_q_mid <- subset(tpmt_q, x > tpmt_lower_bound & x < 1)
tpmt_q_high <- subset(tpmt_q, x >= 1)
tpmt_heatmap_density <- ggplot() +
geom_segment(data = tpmt_q_low, aes(x = x, y = y, xend = x, yend = 0), colour = "blue", size = 2) +
geom_segment(data = tpmt_q_mid, aes(x = x, y = y, xend = x, yend = 0, colour = x), size = 2) +
geom_segment(data = tpmt_q_high, aes(x = x, y = y, xend = x, yend = 0, colour = x), size = 2) +
geom_density(data = subset(tpmt_variants, class == "missense"), aes(x = score, y = ..scaled..), size = 1) +
scale_color_gradientn(colours = c("blue","white","red"), values = rescale(c(tpmt_lower_bound,1,2)), limits=c(tpmt_lower_bound,2)) +
scale_x_continuous(limits = c(-0.8, 2), breaks = seq(-0.4,2,0.4), expand = c(0,0)) +
scale_y_continuous(limits = c(0,1.2), breaks = c(0,0.4,0.8,1.2), expand = c(0,0.002)) + xlab("Variant abundance score") + ylab("Density") +
theme(panel.background = element_rect(fill = "white"), legend.position = "none") +
geom_density(data = subset(tpmt_variants, class == "synonymous"), aes(x = score, y = ..scaled..), color = "darkred", alpha = 0.4, size = 1, linetype = 2) +
geom_density(data = subset(tpmt_variants, class == "nonsense"), aes(x = score, y = ..scaled..), color = "darkblue", alpha = 0.4, size = 1, linetype = 2) +
geom_vline(xintercept = quantile(subset(tpmt_variants, class == "synonymous")$score,0.05), linetype = 2)
ggsave(file = "Output/TPMT_heatmap_density.pdf", tpmt_heatmap_density, height = 1.2, width = 3)
tpmt_heatmap_density
The single amino acid variant distributions were compared between PTEN and TPMT, to see which protein had more single amino acid variants with scores outside of the range of the synonymous distribution.
pten_weighted_average_density_plot <- ggplot() + geom_vline(xintercept = subset(pten_variants, class == "wt")$median_w_ave, linetype = 3) + geom_density(data = subset(pten_variants, class == "synonymous"), aes(x = median_w_ave, y = ..scaled..), color = "red") + geom_density(data = subset(pten_variants, class == "nonsense" & position > 50 & position < 350), aes(x = median_w_ave, y = ..scaled..), color = "blue") + geom_density(data = subset(pten_variants, class == "missense"), aes(x = median_w_ave, y = ..scaled..), color = "black") + scale_x_continuous(limits = c(0.25, 1), expand = c(0,0)) + ylab("Density") + xlab("Median weighted average value")
ggsave(file = "Output/PTEN_weighted_ave_smoothed_histograms.pdf", pten_weighted_average_density_plot, height = 1, width = 2.5)
## Warning: Removed 192 rows containing non-finite values (stat_density).
## Warning: Removed 3527 rows containing non-finite values (stat_density).
pten_weighted_average_density_plot
## Warning: Removed 192 rows containing non-finite values (stat_density).
## Warning: Removed 3527 rows containing non-finite values (stat_density).
tpmt_weighted_average_density_plot <- ggplot() + geom_vline(xintercept = subset(tpmt_variants, class == "wt")$median_w_ave, linetype = 3) + geom_density(data = subset(tpmt_variants, class == "synonymous"), aes(x = median_w_ave, y = ..scaled..), color = "red") + geom_density(data = subset(tpmt_variants, class == "nonsense" & position > 50 & position < 200), aes(x = median_w_ave, y = ..scaled..), color = "blue") + geom_density(data = subset(tpmt_variants, class == "missense"), aes(x = median_w_ave, y = ..scaled..), color = "black") + scale_x_continuous(limits = c(0.25, 1), expand = c(0,0)) + ylab("Density") + xlab("Median weighted average value")
ggsave(file = "Output/TPMT_weighted_ave_smoothed_histograms.pdf", tpmt_weighted_average_density_plot, height = 1, width = 2.5)
## Warning: Removed 28 rows containing non-finite values (stat_density).
## Warning: Removed 962 rows containing non-finite values (stat_density).
tpmt_weighted_average_density_plot
## Warning: Removed 28 rows containing non-finite values (stat_density).
## Warning: Removed 962 rows containing non-finite values (stat_density).
combined_density_plot <- ggplot() + geom_density(data = subset(pten_variants, class == "missense" & !is.na(score)), aes(x = score, y = ..density../2), fill = "grey25", color = "black", alpha = 0.4) +
geom_density(data = subset(tpmt_variants, class == "missense" & !is.na(score)), aes(x = score, y = ..density../2), fill = "dark green", color = "black", alpha = 0.4) +
xlab("Abundance score") + ylab("Density") + geom_vline(xintercept = quantile(subset(pten_variants, class == "synonymous")$score,0.05), linetype = 2) + geom_vline(xintercept = quantile(subset(tpmt_variants, class == "synonymous")$score,0.05), linetype = 2, color = "dark green")
ggsave(file = "Output/Overlapping_missense_smoothed_histograms.pdf", combined_density_plot, height = 1.5, width = 2.5)
combined_density_plot
## [1] "43.1 percent of PTEN missense variants have abundance scores below the synonymous distribution"
## [1] "38.5 percent of TPMT missense variants have abundance scores below the synonymous distribution"
To complement the VAMP-seq data summary provided by the heatmap and density plots, I also made complementary plots showing the median score at each position (thus demonstrating tolerability of each position to substitution), and the PSIC score of evolutionary conservation of each position.
ma <- function(x,n=3){filter(x,rep((1/n),n), sides=2)}
pten_positions$score_moving_average <- ma(pten_positions$score)
PTEN_positional_line_graph <- ggplot() +
geom_point(data = subset(pten_positions, variants_scored >= 5), aes(x = position, y = score), size = 0.5, color = "black") +
geom_line(data = pten_positions, aes(x = position, y = score_moving_average), color = "black", size = 0.8, alpha = 0.5) +coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) + scale_y_continuous(breaks = c(0.24,0.5,0.75,1)) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/PTEN_positional_line_graph.pdf", PTEN_positional_line_graph, height = 6.1, width = 1)
PTEN_positional_line_graph
tpmt_positions$score_moving_average <- ma(tpmt_positions$score)
TPMT_positional_line_graph <- ggplot() +
geom_point(data = subset(tpmt_positions, variants_scored >= 5), aes(x = position, y = score), size = 0.5, color = "black") +
geom_line(data = tpmt_positions, aes(x = position, y = score_moving_average), color = "black", size = 0.8, alpha = 0.5) +coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/TPMT_positional_line_graph.pdf", TPMT_positional_line_graph, height = 3.6, width = 1)
TPMT_positional_line_graph
pten_positions$psic_moving_average <- ma(pten_positions$AA1_psic)
PTEN_psic_line_graph <- ggplot() +
geom_point(data = pten_positions, aes(x = position, y = AA1_psic), size = 0.5, color = "black") +
geom_line(data = pten_positions, aes(x = position, y = psic_moving_average), color = "black", size = 0.8, alpha = 0.5) +coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/PTEN_psic_line_graph.pdf", PTEN_psic_line_graph, height = 6.1, width = 1)
PTEN_psic_line_graph
tpmt_positions$psic_moving_average <- ma(tpmt_positions$AA1_psic)
TPMT_psic_line_graph <- ggplot() +
geom_point(data = tpmt_positions, aes(x = position, y = AA1_psic), size = 0.5, color = "black") +
geom_line(data = tpmt_positions, aes(x = position, y = psic_moving_average), color = "black", size = 0.8, alpha = 0.5) +coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/TPMT_psic_line_graph.pdf", TPMT_psic_line_graph, height = 3.6, width = 1)
TPMT_psic_line_graph
PTEN_positional_bar_graph <- ggplot() +
geom_bar(data = pten_positions, aes(x = position, y = variants_scored), color = NA, fill = "grey70", stat = "identity") +
coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) + scale_y_continuous(breaks = c(0,5,10,15,19), limits = c(0,19), expand = c(0,0)) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/PTEN_positional_bar_graph.pdf", PTEN_positional_bar_graph, height = 6.1, width = 0.8)
PTEN_positional_bar_graph
TPMT_positional_bar_graph <- ggplot() +
geom_bar(data = tpmt_positions, aes(x = position, y = variants_scored), color = NA, fill = "grey70", stat = "identity") +
coord_flip() +
scale_x_reverse(expand = c(0,0.05), breaks = c(1, seq(10,400,10))) + xlab(NULL) + ylab(NULL) + scale_y_continuous(breaks = c(0,5,10,15,19), limits = c(0,19), expand = c(0,0)) +
theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
ggsave(file = "Output/TPMT_positional_bar_graph.pdf", TPMT_positional_bar_graph, height = 3.6, width = 0.8)
TPMT_positional_bar_graph
I also made simple line graphs that show the domain architectures for PTEN and TPMT.
PTEN_architecture_schematic <- ggplot() +
geom_rect(aes(xmin = 1, xmax = 15, ymin = 0, ymax = 1), color = NA, fill = "grey50", alpha = 0.5) +
geom_rect(aes(xmin = 15, xmax = 185, ymin = 0, ymax = 1), color = NA, fill = "#cee076", alpha = 0.5) +
geom_rect(aes(xmin = 185, xmax = 351, ymin = 0, ymax = 1), color = NA, fill = "#d28de8", alpha = 0.5) +
geom_rect(aes(xmin = 351, xmax = 400, ymin = 0, ymax = 1), color = NA, fill = "grey30", alpha = 0.5) +
geom_segment(aes(x = 1, y = 0, xend = 403), yend = 0, size = 2) +
scale_x_reverse(breaks = c(1,185,350), expand = c(0,0)) +
scale_y_continuous(breaks = NULL, expand = c(0,0)) + xlab(NULL) + ylab(NULL) + coord_flip()
ggsave(file = "Output/PTEN_architecture.pdf", PTEN_architecture_schematic, height = 6, width = 0.6)
PTEN_architecture_schematic
TPMT_architecture_schematic <- ggplot() +
geom_rect(aes(xmin = 1, xmax = 245, ymin = 0, ymax = 1), color = NA, fill = "grey50", alpha = 0.5) +
geom_segment(aes(x = 1, y = 0, xend = 245), yend = 0, size = 2) +
scale_x_reverse(breaks = c(1,245), expand = c(0,0)) +
scale_y_continuous(breaks = NULL, expand = c(0,0)) + xlab(NULL) + ylab(NULL) + coord_flip()
ggsave(file = "Output/TPMT_architecture.pdf", TPMT_architecture_schematic, height = 3.6, width = 0.6)
TPMT_architecture_schematic
And lastly, a line graph that shows the basic secondary structural elements observed in the pdb of PTEN and TPMT. Blue is alpha helices, and pink is beta-strands. Black are positions resolved in the crystal structure, while grey is missing positions.
PTEN_dssp_schematic <- ggplot() +
geom_segment(aes(x = 1, y = 0, xend = max(pten_positions$position)), yend = 0, size = 1, color = "grey70") +
geom_point(data = subset(pten_positions, !is.na(xca)), aes(x = position, y = 0), color = "black", size = 1.8) +
geom_point(data = subset(pten_positions, sheet == 1), aes(x = position, y = 0), color = "pink", size = 1.5) +
geom_point(data = subset(pten_positions, helix == 1), aes(x = position, y = 0), color = "cyan", size = 1.5) +
scale_x_reverse(breaks = c(1,185,350), expand = c(0,0)) +
scale_y_continuous(breaks = NULL, expand = c(0,0)) + xlab(NULL) + ylab(NULL) + coord_flip() +
theme(panel.border = element_blank(), axis.text.y = element_blank())
ggsave(file = "Output/PTEN_dssp_schematic.pdf", PTEN_dssp_schematic, height = 6, width = 0.4)
PTEN_dssp_schematic
tpmt_dssp_schematic <- ggplot() +
geom_segment(aes(x = 1, y = 0, xend = max(tpmt_positions$position)), yend = 0, size = 1, color = "grey70") +
geom_point(data = tpmt_positions, aes(x = position, y = 0), color = "black", size = 1.8) +
geom_point(data = subset(tpmt_positions, sheet == 1), aes(x = position, y = 0), color = "pink", size = 1.5) +
geom_point(data = subset(tpmt_positions, helix == 1), aes(x = position, y = 0), color = "cyan", size = 1.5) +
scale_x_reverse(breaks = c(1,100,200,245), expand = c(0,0)) +
scale_y_continuous(breaks = NULL, expand = c(0,0)) + xlab(NULL) + ylab(NULL) + coord_flip() +
theme(panel.border = element_blank(), axis.text.y = element_blank())
ggsave(file = "Output/TPMT_dssp_schematic.pdf", tpmt_dssp_schematic, height = 3.6, width = 0.3)
tpmt_dssp_schematic
To validate the accuracy of the PTEN VAMP-seq data, I took two dozen variants that spanned the range of scores observed in VAMP-seq, engineered the mutations in a plasmid containing EGFP-tagged PTEN, recombined the cells, and assessed the EGFP/mCherry ratios in the recombined cells by flow cytometry.
pten_control_lm <- lm(formula = score ~ egfp_geomean_log10, data = pten_variants)
PTEN_control_scatterplot <- ggplot() + geom_abline(slope = pten_control_lm$coefficients[[2]], intercept = pten_control_lm$coefficients[[1]], linetype = 2, color = "grey50") +
geom_point(data = pten_variants, aes(x = egfp_geomean_log10, y = score)) +
annotate("text", x = pten_variants$egfp_geomean_log10, y = pten_variants$score + 0.05, label = pten_variants$variant, color = "red", size = 2) +
xlab("Log10 normalized score (wt = 0)\nof EGFP/mCherry geomean") + ylab("Abundance score") +
annotate("text", x = -1.8, y = pten_variants[pten_variants$variant == "C124S","score"] + 0.1, label = paste("n: ", nrow(subset(pten_variants, !is.na(egfp_geomean_log10) & !is.na(score))), sep = ""), parse = TRUE) +
annotate("text", x = -1.8, y = pten_variants[pten_variants$variant == "C124S","score"], label = paste("r: ", round(cor(pten_variants$score, pten_variants$egfp_geomean_log10, use="pairwise.complete.obs", method = "pearson"),2), sep = ""), parse = TRUE) +
annotate("text", x = -1.8, y = pten_variants[pten_variants$variant == "C124S","score"] - 0.1, label = paste("rho:", round(cor(pten_variants$score, pten_variants$egfp_geomean_log10, use="pairwise.complete.obs", method = "spearman"),2), sep = ""), parse = TRUE) +
scale_y_continuous(expand = c(0,0))
PTEN_control_scatterplot
ggsave(file = "Output/PTEN_validation_individual.pdf", plot = PTEN_control_scatterplot, height =3, width = 3)
## [1] "PTEN individual validation correlation, n: 25"
## [1] "PTEN individual validation correlation, Pearson's R: 0.96"
## [1] "PTEN individual validation correlation, Spearman's rho: 0.96"
A similar analysis testing individual variants of TPMT.
tpmt_variants[tpmt_variants$variant == "WT","single_WTNorm"] <- 1
tpmt_subset <- subset(tpmt_variants, !is.na(single_WTNorm) & !is.na(score))
tpmt_subset$egfp_geomean_log10 <- log10(tpmt_subset$single_WTNorm)
tpmt_control_lm <- lm(formula = score ~ egfp_geomean_log10, data = tpmt_subset)
paste("TPMT individual validation correlation, n:", nrow(subset(tpmt_subset, !is.na(score) & !is.na(egfp_geomean_log10))))
## [1] "TPMT individual validation correlation, n: 19"
paste("TPMT individual validation correlation, Pearson's R:", round(cor(tpmt_subset$score, tpmt_subset$egfp_geomean_log10, use="pairwise.complete.obs", method = "pearson"),2))
## [1] "TPMT individual validation correlation, Pearson's R: 0.75"
paste("TPMT individual validation correlation, Spearman's rho:", round(cor(tpmt_subset$score, tpmt_subset$egfp_geomean_log10, use="pairwise.complete.obs", method = "spearman"),2))
## [1] "TPMT individual validation correlation, Spearman's rho: 0.61"
TPMT_control_scatterplot <- ggplot() + geom_abline(slope = tpmt_control_lm$coefficients[[2]], intercept = tpmt_control_lm$coefficients[[1]], linetype = 2, color = "grey50") +
geom_point(data = tpmt_subset, aes(x = egfp_geomean_log10, y = score)) +
annotate("text", x = tpmt_subset$egfp_geomean_log10, y = tpmt_subset$score + 0.03, label = tpmt_subset$variant, color = "red", size = 2) +
xlab("Log10 normalized score (wt = 0)\nof EGFP/mCherry geomean") + ylab("Abundance score") +
annotate("text", x = min(tpmt_subset$egfp_geomean_log10) + 0.12, y = max(tpmt_subset$score), label = paste("n:", nrow(tpmt_subset), sep = ""), parse = TRUE) +
annotate("text", x = min(tpmt_subset$egfp_geomean_log10) + 0.12, y = max(tpmt_subset$score) - 0.1, label = paste("r:", round(sqrt(summary(tpmt_control_lm)$r.squared),2), sep = ""), parse = TRUE) +
annotate("text", x = min(tpmt_subset$egfp_geomean_log10) + 0.12, y = max(tpmt_subset$score) - 0.2, label = paste("rho:", round(cor(tpmt_subset$score, tpmt_subset$egfp_geomean_log10, use="pairwise.complete.obs", method = "spearman"),2), sep = ""), parse = TRUE) +
scale_y_continuous(expand = c(0,0.1)) + scale_x_continuous(expand = c(0,0.15))
TPMT_control_scatterplot
ggsave(file = "Output/TPMT_validation_individual.pdf", plot = TPMT_control_scatterplot, height =3, width = 3)
As another test, PTEN VAMP-seq scores were compared with abundance phenotypes observed in western blots from published papers. See supplementary table 10 for a list of variant abundance phenotypes, along with the corresponding PMID of the report.
pten_lit_destabilized_subset <- subset(pten_variants, !is.na(pten_variants$lit_destabilized))
paste("There are",nrow(pten_lit_destabilized_subset),"PTEN variants being compared with abundance phenotypes in the literature")
## [1] "There are 41 PTEN variants being compared with abundance phenotypes in the literature"
for(x in 1:nrow(pten_lit_destabilized_subset)){
if(pten_lit_destabilized_subset$lit_destabilized[x] == "no"){pten_lit_destabilized_subset$lit_destabilized[x] <- "WT-like in reports"}
if(pten_lit_destabilized_subset$lit_destabilized[x] == "conflict"){pten_lit_destabilized_subset$lit_destabilized[x] <- "Conflicting in reports"}
if(pten_lit_destabilized_subset$lit_destabilized[x] == "yes"){pten_lit_destabilized_subset$lit_destabilized[x] <- "Less present in reports"}
}
pten_lit_destabilized_subset$lit_destabilized <- factor(pten_lit_destabilized_subset$lit_destabilized, levels = c("WT-like in reports","Conflicting in reports","Less present in reports"))
pten_lit_destabilized_plot <- ggplot() +
geom_jitter(data = pten_lit_destabilized_subset, aes(x = pten_lit_destabilized_subset$lit_destabilized, y = score, color = pten_lit_destabilized_subset$lit_destabilized),size = 3, alpha = 0.5, width = 0.4) +
xlab("Western blot phenotypes in literature") + ylab("Stability score from expts") +
theme(legend.position = "none", axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + geom_hline(yintercept = pten_synonymous_5th, linetype = 2)
pten_lit_destabilized_plot
ggsave(file = "Output/PTEN_literature_stability_plot.pdf", pten_lit_destabilized_plot, height = 3.5, width = 2)
## Run a ratio permutation test to understand significance
lit_destabilized_actual_ratio <- mean(subset(pten_lit_destabilized_subset, lit_destabilized == "Less present in reports")$score, na.rm = TRUE) /
mean(subset(pten_lit_destabilized_subset, lit_destabilized == "WT-like in reports")$score, na.rm = TRUE)
iterations <- 100
lit_destabilized_starting_vector <- as.character(pten_lit_destabilized_subset$lit_destabilized)
lit_stabilized_working_matrix <- data.frame(matrix(NA,ncol = 1, nrow = length(lit_destabilized_starting_vector)))
lit_destabilized_ratio_matrix <- data.frame(matrix(NA,ncol = iterations, nrow = 1))
for(y in 1:iterations){
for(x in 1:length(lit_destabilized_starting_vector)){
lit_stabilized_working_matrix[x,1] <- sample(lit_destabilized_starting_vector, 1, replace = FALSE)
}
lit_stabilized_working_matrix <- cbind(pten_lit_destabilized_subset[,c("variant","score")], lit_stabilized_working_matrix)
colnames(lit_stabilized_working_matrix) <- c("variant","score","class")
lit_destabilized_ratio_matrix[1,y] <- mean(subset(lit_stabilized_working_matrix, class == "Less present in reports")$score, na.rm = TRUE) /
mean(subset(lit_stabilized_working_matrix, class == "WT-like in reports")$score, na.rm = TRUE)
}
paste(sum(lit_destabilized_ratio_matrix < lit_destabilized_actual_ratio),"of",iterations,"permutations were smaller")
## [1] "0 of 100 permutations were smaller"
paste("The p value by the permutation test is less than ", round((sum(lit_destabilized_ratio_matrix < lit_destabilized_actual_ratio) + 1) / (iterations + 1), 5), sep = "")
## [1] "The p value by the permutation test is less than 0.0099"
A similar analysis with TPMT. Please see supplementary table 10 to find a list of variants with WT-normalized published abundance phenotype, with the corresponding PMID of the report.
tpmt_lit_destabilized_subset <- subset(tpmt_variants, !is.na(tpmt_variants$published_western) & (tpmt_variants$published_western != "NaN") & !is.na(score))
paste("There are",nrow(tpmt_lit_destabilized_subset),"TPMT variants being compared with abundance phenotypes in the literature")
## [1] "There are 20 TPMT variants being compared with abundance phenotypes in the literature"
tpmt_lit_destabilized_plot <- ggplot() + #geom_boxplot(data = tpmt_lit_destabilized_subset, aes(x = tpmt_lit_destabilized_subset$published_western, y = score)) +
geom_point(data = tpmt_lit_destabilized_subset, aes(x = tpmt_lit_destabilized_subset$published_western, y = score), size = 3, alpha = 0.5) +
xlab("Described stability in the literature\n(western blots)") + ylab("Stability score from expts") +
theme(legend.position = "none") + geom_hline(yintercept = 0.5, linetype = 2)
tpmt_lit_destabilized_plot
ggsave(file = "Output/TPMT_literature_stability_plot.pdf", tpmt_lit_destabilized_plot, height = 2.5, width = 2.5)
This chunk of code compares how PTEN VAMP-seq abundance scores correlate with melting temperatures tested with purified PTEN protein in vitro (PMID: 25647146).
pten_melting_temp_subset <- subset(pten_variants, !is.na(pten_variants$tm))
pten_melting_temp_lm <- lm(formula = tm ~ score, data = pten_melting_temp_subset)
melting_plot <- ggplot(data = pten_melting_temp_subset, aes(x = tm, y = score)) + geom_smooth(method = "lm", se = FALSE, color = "grey75", linetype = 2) +
geom_point() + annotate("text", x = pten_melting_temp_subset$tm, y = pten_melting_temp_subset$score + 0.05, label = pten_melting_temp_subset$variant, color = "red") +
ylab("Abundance score\n") + xlab("\nMelting temperature (Johnston et al, 2015)") + scale_x_continuous(expand= c(0.1,0.1)) +
annotate("text", x = min(pten_melting_temp_subset$tm), y = pten_variants[pten_variants$variant == "C124S","score"], label = paste("r:", round(sqrt(summary(pten_melting_temp_lm)$r.squared),2)), hjust = 0, parse = TRUE) +
annotate("text", x = min(pten_melting_temp_subset$tm), y = pten_variants[pten_variants$variant == "C124S","score"] - 0.1, label = paste("rho:", round(cor(pten_melting_temp_subset$score, pten_melting_temp_subset$tm, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0, parse = TRUE)
ggsave(file = "Output/PTEN_melting_temp_plot.pdf", melting_plot, height = 3, width = 3)
melting_plot
paste("PTEN melting temperature and abundance score correlation, Pearson's R:", round(cor(pten_melting_temp_subset$score, pten_melting_temp_subset$tm, use="pairwise.complete.obs", method = "pearson"),2))
## [1] "PTEN melting temperature and abundance score correlation, Pearson's R: 0.97"
paste("PTEN melting temperature and abundance score correlation, Spearman's rho:", round(cor(pten_melting_temp_subset$score, pten_melting_temp_subset$tm, use="pairwise.complete.obs", method = "spearman"),2))
## [1] "PTEN melting temperature and abundance score correlation, Spearman's rho: 0.9"
We looked for some basic patterns of similarity or difference in the tolerability of PTEN and TPMT to single amino acid subsitutions. Mutation of WT hydrophobic aromatic, methionine, or long nonpolar aliphatic amino acids produced the largest decreases in abundance for both proteins.
pten_residue_median <- ddply(subset(pten_variants, class == "missense" & !(is.na(score)))[,c("start","score")], "start", numcolwise(median))
tpmt_residue_median <- ddply(subset(tpmt_variants, class == "missense" & !(is.na(score)))[,c("start","score")], "start", numcolwise(median))
start_residue_median <- merge(pten_residue_median, tpmt_residue_median, by = "start")
colnames(start_residue_median) <- c("start","pten_residue_median","tpmt_residue_median")
starting_residue_median_scatterplot <- ggplot() + geom_point(data = start_residue_median, aes(x = pten_residue_median, y = tpmt_residue_median), alpha = 0.4) +
annotate("text", x = start_residue_median$pten_residue_median, y = start_residue_median$tpmt_residue_median + 0.015, label = start_residue_median$start) +
xlab("PTEN residue") + ylab("TPMT residue") + scale_y_continuous(limits = c(0.5, 1), breaks = seq(0.5, 1, 0.1)) + scale_x_continuous(limits = c(0.2, 1), breaks = seq(0.2, 1, 0.1)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
ggsave(file = "Output/Starting_residue_median_scatterplot.pdf", starting_residue_median_scatterplot, height = 2.5, width = 1.75)
starting_residue_median_scatterplot
pten_residue_median <- ddply(subset(pten_variants, class == "missense" & !(is.na(score)))[,c("end","score")], "end", numcolwise(median))
tpmt_residue_median <- ddply(subset(tpmt_variants, class == "missense" & !(is.na(score)))[,c("end","score")], "end", numcolwise(median))
end_residue_median <- merge(pten_residue_median, tpmt_residue_median, by = "end")
colnames(end_residue_median) <- c("end","pten_residue_median","tpmt_residue_median")
ending_residue_median_scatterplot <- ggplot() + geom_point(data = end_residue_median, aes(x = pten_residue_median, y = tpmt_residue_median), alpha = 0.4) +
annotate("text", x = end_residue_median$pten_residue_median, y = end_residue_median$tpmt_residue_median + 0.015, label = end_residue_median$end) +
xlab("PTEN residue") + ylab("TPMT residue") + scale_y_continuous(limits = c(0.58, 0.95), breaks = seq(0.6, 1, 0.1)) + scale_x_continuous(limits = c(0.5, 0.85), breaks = seq(0.5, 1, 0.1)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
ggsave(file = "Output/Ending_residue_median_scatterplot.pdf", ending_residue_median_scatterplot, height = 2.5, width = 1.75)
ending_residue_median_scatterplot
We then determined which features correlated with abundance score for the two different proteins using Spearman’s correlation coefficients. In fact, WT amino acid hydrophobicity was negatively correlated with abundance score (Fig. 3b, WT hydroΦ), whereas mutant amino acid hydrophobicity was positively correlated with abundance score (MT hydroΦ). Conversely, mutations of WT amino acids with high relative solvent accessibility (RSA), polarity (WT Polarity), and crystal-structure temperature factor (B-factor), all features associated with polar residues present on the protein surface, were associated with high abundance scores.
pten_variants_min <- subset(pten_variants, !is.na(score) & !is.na(rsa) & !is.na(evolutionary_coupling_avg) & class == "missense")
specific_features <- c("variant","class","score","rsa","abs_tco","kappa","alpha","phi","psi","bfactor","grantham","hydro1","hydro2","hydrodiff","vol1","vol2","voldiff","polarity1","polarity2","polaritydiff","weight1","weight2","weightdiff","crowding","AA1_PI","AA2_PI","deltaPI","AA1_psic","AA2_psic","delta_psic","evolutionary_coupling_avg")
pten_input_output <- pten_variants_min[,append(c("variant","score"),specific_features)]
pten_spearman_frame <- data.frame(matrix(nrow = length(specific_features) - 5, ncol = 2, "NA"))
colnames(pten_spearman_frame) <- c("feature","spearman_r")
pten_spearman_frame$feature <- as.character(pten_spearman_frame$feature)
pten_spearman_frame$spearman_r <- as.numeric(pten_spearman_frame$spearman_r)
y = 1
for(x in 4:length(specific_features)){
pten_spearman_frame[y,"feature"] <- specific_features[x]
pten_spearman_frame[y,"spearman_r"] <- cor(pten_input_output[,specific_features[x]], pten_input_output$score, method = "spearman")
y = y + 1
}
pten_spearman_frame$abs_spearman_r <- abs(as.numeric(pten_spearman_frame$spearman_r))
pten_spearman_frame <- pten_spearman_frame[order(-pten_spearman_frame$abs_spearman_r),]
tpmt_variants_min <- subset(tpmt_variants, !is.na(score) & !is.na(rsa) & !is.na(evolutionary_coupling_avg) & class == "missense")
specific_features <- c("variant","class","score","rsa","abs_tco","kappa","alpha","phi","psi","bfactor","grantham","hydro1","hydro2","hydrodiff","vol1","vol2","voldiff","polarity1","polarity2","polaritydiff","weight1","weight2","weightdiff","crowding","AA1_PI","AA2_PI","deltaPI","AA1_psic","AA2_psic","delta_psic","evolutionary_coupling_avg")
tpmt_input_output <- tpmt_variants_min[,append(c("variant","score"),specific_features)]
tpmt_spearman_frame <- data.frame(matrix(nrow = length(specific_features) - 5, ncol = 2, "NA"))
colnames(tpmt_spearman_frame) <- c("feature","spearman_r")
tpmt_spearman_frame$feature <- as.character(tpmt_spearman_frame$feature)
tpmt_spearman_frame$spearman_r <- as.numeric(tpmt_spearman_frame$spearman_r)
y = 1
for(x in 4:length(specific_features)){
tpmt_spearman_frame[y,"feature"] <- specific_features[x]
tpmt_spearman_frame[y,"spearman_r"] <- cor(tpmt_input_output[,specific_features[x]], tpmt_input_output$score, method = "spearman")
y = y + 1
}
tpmt_spearman_frame$abs_spearman_r <- abs(as.numeric(tpmt_spearman_frame$spearman_r))
tpmt_spearman_frame <- tpmt_spearman_frame[order(-tpmt_spearman_frame$abs_spearman_r),]
spearman_frame_merged <- merge(pten_spearman_frame[c("feature","spearman_r")], tpmt_spearman_frame[c("feature","spearman_r")], by = "feature")
colnames(spearman_frame_merged) <- c("feature","pten_spearman_r","tpmt_spearman_r")
spearman_frame_merged$mean = (spearman_frame_merged$pten_spearman_r + spearman_frame_merged$tpmt_spearman_r)/2
spearman_frame_merged$abs_spearman_r <- abs(as.numeric(spearman_frame_merged$mean))
spearman_frame_merged <- spearman_frame_merged[order(-spearman_frame_merged$abs_spearman_r),]
evolutionary_sequence <- c("AA1_psic","AA2_psic","delta_psic","evolutionary_coupling_avg")
primary_sequence <- c("hydro1","hydro2","hydrodiff","vol1","vol2","grantham","AA1_PI","AA2_PI","polarity1","polarity2","polaritydiff")
structural_sequence <- c("rsa","bfactor","crowding","abs_tco","substr_dist","alpha","phi","psi")
plotlabel_mergeframe <- subset(spearman_frame_merged, feature %in% c("rsa","bfactor","hydro2","evolutionary_coupling_avg","substr_dist","delta_psic","abs_tco","grantham","polarity1","polaritydiff","AA1_psic","AA2_psic","hydro1","hydrodiff","crowding"))
Both_spearman_features <- ggplot() + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) +
geom_point(data = subset(spearman_frame_merged, feature %in% evolutionary_sequence), aes(x = pten_spearman_r, y = tpmt_spearman_r), color = "red") +
geom_point(data = subset(spearman_frame_merged, feature %in% primary_sequence), aes(x = pten_spearman_r, y = tpmt_spearman_r), color = "cyan") +
geom_point(data = subset(spearman_frame_merged, feature %in% structural_sequence), aes(x = pten_spearman_r, y = tpmt_spearman_r), color = "blue") +
geom_text(data = plotlabel_mergeframe, aes(x = pten_spearman_r, y = tpmt_spearman_r + 0.01), label = plotlabel_mergeframe$feature, size = 2.5)
ggsave(file = "Output/Both_spearman_features.pdf", Both_spearman_features, height = 3, width = 3)
Both_spearman_features
Looking at how the PTEN and TPMT positional median scores correlate with secondary structure and PSIC measurements of evolutionary conservation.
## [1] "Spearman's rho for PTEN AA1_psic and score: -0.26"
## [1] "Spearman's rho for TPMT AA1_psic and score: -0.59"
pten_secondary_structure_score_plot <- ggplot() + geom_jitter(data = subset(pten_positions, helix != 1 & sheet != 1 & !is.na(xca)), aes(x = "3. No secondary structure", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "3. No secondary structure", y = median(subset(pten_positions, helix != 1 & sheet != 1 & !is.na(xca))$score, na.rm = TRUE), ymin = median(subset(pten_positions, helix != 1 & sheet != 1 & !is.na(xca))$score, na.rm = TRUE), ymax = median(subset(pten_positions, helix != 1 & sheet != 1 & !is.na(xca))$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(pten_positions, helix == 1), aes(x = "1. Alpha helix", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "1. Alpha helix", y = median(subset(pten_positions, helix == 1)$score, na.rm = TRUE), ymin = median(subset(pten_positions, helix == 1)$score, na.rm = TRUE), ymax = median(subset(pten_positions, helix == 1)$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(pten_positions, sheet == 1), aes(x = "2. Beta sheet", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "2. Beta sheet", y = median(subset(pten_positions, sheet == 1)$score, na.rm = TRUE), ymin = median(subset(pten_positions, sheet == 1)$score, na.rm = TRUE), ymax = median(subset(pten_positions, sheet == 1)$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(pten_positions, is.na(xca)), aes(x = "4. Unresolved in structure", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "4. Unresolved in structure", y = median(subset(pten_positions, is.na(xca))$score, na.rm = TRUE), ymin = median(subset(pten_positions, is.na(xca))$score, na.rm = TRUE), ymax = median(subset(pten_positions, is.na(xca))$score, na.rm = TRUE)), color = "red") +
xlab(NULL) + ylab("PTEN abundance score") + theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) + scale_y_continuous(breaks = seq(0,1.2,0.2)) +
annotate("text", x = "1. Alpha helix", y = 1.4, label = paste("n:",nrow(subset(pten_positions, helix == 1)))) +
annotate("text", x = "2. Beta sheet", y = 1.4, label = paste("n:",nrow(subset(pten_positions, sheet == 1)))) +
annotate("text", x = "3. No secondary structure", y = 1.4, label = paste("n:",nrow(subset(pten_positions, helix != 1 & sheet != 1 & !is.na(xca))))) +
annotate("text", x = "4. Unresolved in structure", y = 1.4, label = paste("n:",nrow(subset(pten_positions, is.na(xca)))))
ggsave("Output/PTEN_secondary_structure_score_plot.pdf", pten_secondary_structure_score_plot, height = 3.5, width = 4)
pten_secondary_structure_score_plot
tpmt_secondary_structure_score_plot <- ggplot() + geom_jitter(data = subset(tpmt_positions, helix != 1 & sheet != 1), aes(x = "3. No secondary structure", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "3. No secondary structure", y = median(subset(tpmt_positions, helix != 1 & sheet != 1)$score, na.rm = TRUE), ymin = median(subset(tpmt_positions, helix != 1 & sheet != 1)$score, na.rm = TRUE), ymax = median(subset(tpmt_positions, helix != 1 & sheet != 1)$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(tpmt_positions, helix == 1), aes(x = "1. Alpha helix", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "1. Alpha helix", y = median(subset(tpmt_positions, helix == 1)$score, na.rm = TRUE), ymin = median(subset(tpmt_positions, helix == 1)$score, na.rm = TRUE), ymax = median(subset(tpmt_positions, helix == 1)$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(tpmt_positions, sheet == 1), aes(x = "2. Beta sheet", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "2. Beta sheet", y = median(subset(tpmt_positions, sheet == 1)$score, na.rm = TRUE), ymin = median(subset(tpmt_positions, sheet == 1)$score, na.rm = TRUE), ymax = median(subset(tpmt_positions, sheet == 1)$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = subset(tpmt_positions, is.na(xca)), aes(x = "4. Unresolved in structure", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "4. Unresolved in structure", y = median(subset(tpmt_positions, is.na(xca))$score, na.rm = TRUE), ymin = median(subset(tpmt_positions, is.na(xca))$score, na.rm = TRUE), ymax = median(subset(tpmt_positions, is.na(xca))$score, na.rm = TRUE)), color = "red") +
xlab(NULL) + ylab("TPMT abundance score") + scale_y_continuous(breaks = seq(0,1.2,0.2)) + theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) + scale_y_continuous(breaks = seq(0,1.2,0.2)) +
annotate("text", x = "1. Alpha helix", y = 1.4, label = paste("n:",nrow(subset(tpmt_positions, helix == 1)))) +
annotate("text", x = "2. Beta sheet", y = 1.4, label = paste("n:",nrow(subset(tpmt_positions, sheet == 1)))) +
annotate("text", x = "3. No secondary structure", y = 1.4, label = paste("n:",nrow(subset(tpmt_positions, helix != 1 & sheet != 1 & !is.na(xca))))) +
annotate("text", x = "4. Unresolved in structure", y = 1.4, label = paste("n:",nrow(subset(tpmt_positions, is.na(xca)))))
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
ggsave("Output/TPMT_secondary_structure_score_plot.pdf", tpmt_secondary_structure_score_plot, height = 3.5, width = 4)
tpmt_secondary_structure_score_plot
This chunk of code imports the pdb for PTEN (1d5r), removes the water molecules, and sets the basic visual motifs. Lastly, it creates a standard vantage point used in all of the subsequent .pml files for PTEN.
A custom coloring dataframe was used to get the colorscales used in this analysis script and display them in the pymol representations of PTEN.
To visualize the 20% of positions least tolerant to substitution, these positions were dispalyed as a semi-transparent surface representation.
# Make a new vector for the 20% lowest positional scores
pymol_lowest_scores <- c()
# Make a command that creates a new object of only the 20 % most destabilization-prone positions
pymol_lowest_scores <- append(pymol_lowest_scores, paste("create lowest_20, pten and resi ",gsub(", ","+",toString(subset(pten_positions, score < quantile(pten_positions$score, 0.2, na.rm = TRUE))$position)),"", sep= ""))
# Make the destabilized positions a semi-transparent surface representation
pymol_lowest_scores <- append(pymol_lowest_scores, c("show surface, lowest_20","set transparency, 0.5, lowest_20","orient pten"))
fileConn<-file("Output/Pymol_1A_PTEN_lowest_scores.pml")
writeLines(c(pymol_setup,pymol_color_setup_vector,pymol_color_score_vector,pymol_lowest_scores,pymol_pten_standard_view,
paste("png ",markdown_directory,"/Output/Pymol_1A_PTEN_lowest_scores.png, width=10cm, dpi=300, ray=1",sep="")
), fileConn)
close(fileConn)
First, I am selecting for PTEN positions that were expected to form hydrogen bonds or salt bridges. Here are some stats on how many of these positions existed in the PTEN crystal structure, as well as how many had (positional) low abundance scores, and shows these score distributions of these two classes of positions on a plot.
## [1] "There are 76 positions potentially involved in h-bond or salt bridge interactions"
## [1] "There are 26 positions potentially involved in h-bond or salt bridge interactions that are intolerant to mutation"
schbond_saltbridge_pten_low_abundance_positions <- subset(pten_positions, (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & abundance_class == "intolerant")
schbond_saltbridge_pten_not_low_abundance_positions <- subset(pten_positions, (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & (abundance_class != "intolerant" | is.na(abundance_class)))
pten_saltbridge_hbond_score_plot <- ggplot() +
geom_jitter(data = schbond_saltbridge_pten_low_abundance_positions, aes(x = "1. Substitution\nintolerant", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "1. Substitution\nintolerant", y = median(schbond_saltbridge_pten_low_abundance_positions$score, na.rm = TRUE), ymin = median(schbond_saltbridge_pten_low_abundance_positions$score, na.rm = TRUE), ymax = median(schbond_saltbridge_pten_low_abundance_positions$score, na.rm = TRUE)), color = "red") +
geom_jitter(data = schbond_saltbridge_pten_not_low_abundance_positions, aes(x = "2. Remaining", y = score), alpha = 0.2) +
geom_crossbar(aes(x = "2. Remaining", y = median(schbond_saltbridge_pten_not_low_abundance_positions$score, na.rm = TRUE), ymin = median(schbond_saltbridge_pten_not_low_abundance_positions$score, na.rm = TRUE), ymax = median(schbond_saltbridge_pten_not_low_abundance_positions$score, na.rm = TRUE)), color = "red") + annotate("text", x = "1. Substitution\nintolerant", y = 1.3, label = paste("n:",nrow(schbond_saltbridge_pten_low_abundance_positions))) +
annotate("text", x = "2. Remaining", y = 1.3, label = paste("n:",nrow(schbond_saltbridge_pten_not_low_abundance_positions))) +
xlab(NULL) + ylab("PTEN abundance score") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
ggsave("Output/PTEN_saltbridge_hbond_score_plot.pdf", pten_saltbridge_hbond_score_plot, height = 2.5, width = 1.2)
## Warning: Removed 8 rows containing missing values (geom_point).
pten_saltbridge_hbond_score_plot
## Warning: Removed 8 rows containing missing values (geom_point).
This chunk of code performs heirarchical clustering of these substitution intolerant positions containing potential hydrogen bonds and salt bridges, so find regions of the PTEN protein that may be particularly depending on these polar interactions for folding. The frequencies that these positions are observed in COSMIC are also shown as a separate plot.
schbond_saltbridge_pten_positions <- subset(pten_positions, (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & abundance_class == "intolerant")
schbond_saltbridge_positions <- schbond_saltbridge_pten_positions[,c("sc_xca","sc_yca","sc_zca")]
rownames(schbond_saltbridge_positions) <- schbond_saltbridge_pten_positions$position
schbond_saltbridge_frame <- data.frame(matrix(ncol = nrow(schbond_saltbridge_pten_positions), nrow = nrow(schbond_saltbridge_pten_positions)))
colnames(schbond_saltbridge_frame) <- schbond_saltbridge_pten_positions$position; rownames(schbond_saltbridge_frame) <- schbond_saltbridge_pten_positions$position
schbond_saltbridge_frame2 <- dist(schbond_saltbridge_positions, method = "euclidean")
fit <- hclust(schbond_saltbridge_frame2, method="ward.D2")
dhc <- as.dendrogram(fit)
ddata <- dendro_data(dhc, type = "rectangle")
schbond_saltbridge_frame2_cut <- cutree(fit, h = 11)
schbond_saltbridge_frame2_cut <- data.frame(schbond_saltbridge_frame2_cut)
schbond_saltbridge_frame2_cut$label <- rownames(schbond_saltbridge_frame2_cut)
colnames(schbond_saltbridge_frame2_cut) <- c("group","label")
ddata_labels <- data.frame(ddata$labels)
ddata_labels <- merge(ddata_labels, schbond_saltbridge_frame2_cut, by = "label")
ddata_labels$group <- as.integer(ddata_labels$group)
ddata_labels$label <- as.character(ddata_labels$label)
ddata_labels$color_group <- 0
color_group_counter = 1
for(group_number in sort(unique(as.integer(ddata_labels$group)))){
query_group <- pten_positions[pten_positions$position %in% ddata_labels[ddata_labels$group == group_number,"label"],c("sc_xca","sc_yca","sc_zca")]
if(nrow(query_group) > 1){
query_matrix <- matrix(NA, nrow = nrow(query_group), ncol = nrow(query_group))
colnames(query_matrix) <- ddata_labels[ddata_labels$group == group_number,"label"]
rownames(query_matrix) <- ddata_labels[ddata_labels$group == group_number,"label"]
for(x in 1:nrow(query_group)){
for(y in 1:nrow(query_group)){
query_matrix[x,y] <- sqrt((query_group[x,"sc_xca"] - query_group[y,"sc_xca"])^2 + (query_group[x,"sc_yca"] - query_group[y,"sc_yca"])^2 + (query_group[x,"sc_zca"] - query_group[y,"sc_zca"])^2)
}
}
if(max(query_matrix) < 11){ddata_labels[ddata_labels$group == group_number,"color_group"] <- color_group_counter; color_group_counter = color_group_counter + 1}
}
}
ddata_labels$color_group_factor <- as.factor(ddata_labels$color_group)
sc_interaction_color_groups <- c("0"="grey50","1"="salmon","2"="brown","3"="limegreen","4"="violet","5"="orange","6"="yellow","7"="pink","8"="darkgreen")
Destabilizing_hbond_saltbridge_tree <- ggplot(segment(ddata)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = ddata_labels, aes(x = x, y = y-0.2, label = label, hjust = 0, vjust = 0.5, angle = -90, colour = color_group_factor)) +
geom_hline(yintercept = 11, linetype = 2) + xlab(NULL) + ylab(NULL) + scale_x_continuous(breaks = NULL) + scale_y_continuous(expand = c(0,15)) + theme(axis.text.x = element_blank()) + scale_colour_manual(values = sc_interaction_color_groups)
ggsave(file = "Output/PTEN_destabilizing_hbond_saltbridge_tree.pdf", Destabilizing_hbond_saltbridge_tree, height = 3, width = 6)
Destabilizing_hbond_saltbridge_tree
ddata_labels2 <- data.frame("position" = ddata_labels$label, "group" = ddata_labels$color_group)
ddata_labels2 <- merge(ddata_labels2, pten_positions[,c("position","cosmic_count")], by = "position", all.x = TRUE)
ddata_labels2 <- ddata_labels2[order(-ddata_labels2$cosmic_count),]
ddata_labels2[is.na(ddata_labels2)] <- 0
ddata_labels2 <- ddata_labels2[order(ddata_labels2$group),]
PTEN_destabilizing_hbond_cosmic_counts <- ggplot() + geom_point(data = ddata_labels2, aes(x = group, y = cosmic_count), alpha = 0.5) +
geom_text(data = subset(ddata_labels2, cosmic_count > 2), aes(x = group, y = cosmic_count), label = subset(ddata_labels2, cosmic_count > 2)[,"position"], hjust = 0, position = position_nudge(x = 0.1)) + ylab("Frequency of mutation in COSIMC") + xlab("Cluster group of mutation intolerant positions\nlikely involved in polar contacts") +
scale_x_continuous(breaks = seq(1,8)) + scale_y_continuous(breaks = seq(0,20,2))
ggsave(file = "Output/PTEN_destabilizing_hbond_cosmic_counts.pdf", PTEN_destabilizing_hbond_cosmic_counts, height = 2.5, width = 3)
PTEN_destabilizing_hbond_cosmic_counts
PTEN_all_cosmic_counts <- ggplot() + scale_x_continuous(limit = c(0,35)) +
geom_histogram(data = pten_positions, aes(x = cosmic_count), alpha = 0.5, binwidth = 1) +
geom_text(data = subset(pten_positions, cosmic_count > 7 & (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & abundance_class == "intolerant" & ! position %in% c(252,277)), aes(x = cosmic_count, y = 15, label = position, angle = 45), color = "red") +
geom_text(data = subset(pten_positions, cosmic_count > 7 & (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & abundance_class == "intolerant" & position %in% c(252,277)), aes(x = cosmic_count, y = 25, label = position, angle = 45), color = "red") +
geom_point(data = subset(pten_positions, cosmic_count > 7 & (saltbridge_unique_partners >= 1 | schbond_unique_partners >= 1) & abundance_class == "intolerant"), aes(x = cosmic_count, y = 10), color = "red", size = 0.5) +
xlab("Number of mutations in COSMIC") + ylab("Number of residues") +
annotate("text", x = 32, y = 30, label = "Not shown:\nPosition 173: 93\nPosition 130: 355", angle = 90, hjust = 0)
ggsave(file = "Output/PTEN_cosmic_counts.pdf", PTEN_all_cosmic_counts, height = 3, width = 3.5)
PTEN_all_cosmic_counts
## [1] "The number of mutations in COSMIC for position 24 and 26 are 26, 3"
## [1] "The number of mutations in COSMIC for positions 66, 68, 107, 111, 115 are 8, 23, 12, 3, 1"
## [1] "The number of mutations in COSMIC for position 153 and 172 are 4, 1"
## [1] "The number of mutations in COSMIC for position 170 and 329 are 16, NA"
## [1] "The number of mutations in COSMIC for positions 177, 252, 276, 277 are 10, 15, 2, 8"
## [1] "The number of mutations in COSMIC for position 179 and 184 are 1, 1"
## [1] "The number of mutations in COSMIC for positions 254, 256, and 269 are 1, 3, NA"
## [1] "The number of mutations in COSMIC for positions 322, 331, and 334 are 1, 4, 1"
This chunk of code makes a .pml file for displaying these potentially critical h-bonds and salt bridges in Pymol, using the same grouping and color scheme used in the above plots.
Pymol_hbonds_saltbridges_vector <- c()
Pymol_hbonds_saltbridges_vector <- append(Pymol_hbonds_saltbridges_vector, c(paste("create sc_hbonds_unstable, resi ",gsub(", ", "+",toString(subset(pten_positions, schbond_unique_partners >= 1 & abundance_class == "intolerant")$position))," and not name c+n+o", sep = ""), "show sticks, sc_hbonds_unstable", "show surface, sc_hbonds_unstable", "set transparency, 0.2, sc_hbonds_unstable"))
Pymol_hbonds_saltbridges_vector <- append(Pymol_hbonds_saltbridges_vector, c(paste("create saltbridge_unstable, resi ",gsub(", ","+",toString(subset(pten_positions, saltbridge_unique_partners >= 1 & abundance_class == "intolerant")$position))," and not name c+n+o", sep = ""), "show sticks, saltbridge_unstable", "show surface, saltbridge_unstable", "set transparency, 0.2, saltbridge_unstable"))
group_color_vector <- c("salmon","brown","limegreen","violet","orange","yellow","hotpink","forest")
color_group_commands <- c()
for(x in 1:max(ddata_labels$color_group)){
color_group_commands <- append(color_group_commands, paste("select group", x, ", (sc_hbonds_unstable or saltbridge_unstable) and resi ",gsub(", ","+",toString(ddata_labels[ddata_labels$color_group == x,"label"])), " and not name c+n+o; color ",group_color_vector[x],", group", x, sep=""))
}
fileConn<-file("Output/Pymol_2_PTEN_Hbonds_saltbridges.pml")
writeLines(c(pymol_setup,pymol_color_setup_vector,pymol_color_score_vector,Pymol_hbonds_saltbridges_vector,pymol_pten_standard_view,color_group_commands
,paste("png ",markdown_directory,"/Output/Pymol_2_PTEN_Hbonds_saltbridges.png, width=10cm, dpi=300, ray=1",sep="")
), fileConn)
close(fileConn)
We identified positions in PTEN that had mean abundance scores higher than WT. Some of these enhanced-abundance positions were in structurally resolved regions, and many of them were within 7 Å of known phospholipid-binding positions. In comparison, far fewer of all structurally resolved PTEN positions were within 7 Å of phospholipid-binding positions (Supplementary Fig. 4e). Thus, positions with abundance-enhancing variants tended to be near the membrane-proximal face of PTEN, and included those important for binding PIP3, PIP2 or PI(3)P.
pten_high_score_positions <- subset(pten_positions, abundance_class == "enhanced")
paste("There are",nrow(pten_high_score_positions),"positions in PTEN that had mean abundance scores higher than WT")
## [1] "There are 41 positions in PTEN that had mean abundance scores higher than WT"
pten_high_score_positions_structure <- subset(pten_high_score_positions, !is.na(xca))
pten_high_score_positions_nostructure <- subset(pten_high_score_positions, is.na(xca))
pten_membrane_positions <- c(14,47,130,seq(260,268))
pten_high_score_positions_temp_matrix <- data.frame(matrix(NA, nrow = nrow(pten_high_score_positions_structure), ncol = length(pten_membrane_positions)))
for(row_counter in 1:nrow(pten_high_score_positions_structure)){
for(col_counter in 1:length(pten_membrane_positions)){
#print(paste(row_counter,col_counter))
pten_high_score_positions_temp_matrix[row_counter,col_counter] <- sqrt((subset(pten_positions, position == pten_membrane_positions[col_counter])$xca - pten_high_score_positions_structure[row_counter,"xca"])^2 +
(subset(pten_positions, position == pten_membrane_positions[col_counter])$yca - pten_high_score_positions_structure[row_counter,"yca"])^2 +
(subset(pten_positions, position == pten_membrane_positions[col_counter])$zca - pten_high_score_positions_structure[row_counter,"zca"])^2)
}
}
pten_high_score_positions_temp_matrix$row_min <- apply(pten_high_score_positions_temp_matrix,1,min)
## This is for comparison
pten_positions_structure <- subset(pten_positions, !is.na(xca))
pten_positions_temp_matrix <- data.frame(matrix(NA, nrow = nrow(pten_positions_structure), ncol = length(pten_membrane_positions)))
for(row_counter in 1:nrow(pten_positions_structure)){
for(col_counter in 1:length(pten_membrane_positions)){
#print(paste(row_counter,col_counter))
pten_positions_temp_matrix[row_counter,col_counter] <- sqrt((subset(pten_positions, position == pten_membrane_positions[col_counter])$xca - pten_positions_structure[row_counter,"xca"])^2 +
(subset(pten_positions, position == pten_membrane_positions[col_counter])$yca - pten_positions_structure[row_counter,"yca"])^2 +
(subset(pten_positions, position == pten_membrane_positions[col_counter])$zca - pten_positions_structure[row_counter,"zca"])^2)
}
}
pten_positions_temp_matrix$row_min <- apply(pten_positions_temp_matrix,1,min)
## Plot the two minimum distances
High_score_position_distances_plot <- ggplot() + geom_histogram(data = pten_positions_temp_matrix, aes(x = row_min), alpha = 0.5, binwidth = 1) +
geom_histogram(data = pten_high_score_positions_temp_matrix, aes(x = row_min), fill = "red", color = "red", alpha = 0.5, binwidth = 1) + scale_x_continuous(breaks = c(0,5,10,20,30)) + geom_vline(xintercept = 7)
ggsave(file = "Output/High_score_position_distances_plot.pdf", High_score_position_distances_plot, height = 2.3, width = 3)
High_score_position_distances_plot
paste(round(sum(pten_high_score_positions_temp_matrix$row_min < 7) / length(pten_high_score_positions_temp_matrix$row_min),3)*100, "percent of high score positions were within 6.5 angstroms of a known active site or membrane localization position")
## [1] "57.9 percent of high score positions were within 6.5 angstroms of a known active site or membrane localization position"
paste(round(sum(pten_positions_temp_matrix$row_min < 7) / length(pten_positions_temp_matrix$row_min),3)*100, "percent of all PTEN positions were within 6.5 angstroms of a known active site or membrane localization position")
## [1] "13 percent of all PTEN positions were within 6.5 angstroms of a known active site or membrane localization position"
# Make a new vector for reference positions
pymol_ref_positions <- c()
# Make a command that creates a new object of only the 20 % most destabilization-prone positions
pymol_ref_positions <- append(pymol_ref_positions, paste("create ref_positions, pten and resi ",gsub(", ","+",toString(pten_membrane_positions)),"", sep= ""))
# Make the reference positions a sphere represntation
pymol_ref_positions <- append(pymol_ref_positions, c("show spheres, ref_positions","set sphere_scale, 0.75, ref_positions","color cyan, ref_positions","orient pten"))
# Make a new vector for the 20% lowest positional scores
pymol_highest_scores <- c()
# Make a command that creates a new object of only the 20 % most destabilization-prone positions
pymol_highest_scores <- append(pymol_highest_scores, paste("create higher_than_wt, pten and resi ",gsub(", ","+",toString(subset(pten_positions, abundance_class == "enhanced")$position)),"", sep= ""))
# Make the destabilized positions a semi-transparent surface representation
pymol_highest_scores <- append(pymol_highest_scores, c("show spheres, higher_than_wt","set sphere_transparency = 0.5, higher_than_wt","color red, higher_than_wt", "orient pten"))
fileConn<-file("Output/Pymol_1B_PTEN_highest_scores.pml")
writeLines(c(pymol_setup,pymol_color_setup_vector,pymol_color_score_vector,pymol_ref_positions,pymol_highest_scores,pymol_pten_standard_view,
paste("png ",markdown_directory,"/Output/Pymol_1B_PTEN_highest_scores.png, width=10cm, dpi=300, ray=1",sep="")
), fileConn)
close(fileConn)
Phosphorylation at various sites of the unstructured C-terminal tail of PTEN is known to be involved in negatively regulating its activity. We observed that phosphomimetic substitutions at the S385 PTEN C-terminal regulatory phosphosite exhibited the highest abundance scores, whereas positively charged substitutions had low scores, supporting the impact of phosphorylation at this site on abundance.
pten_phosphorylation_sites <- c(385) ## residues 229,321,336 were outside this region c(366,370,380,382,383,385)
test <- pten_variants[pten_variants$position %in% pten_phosphorylation_sites,]
nonphospho_mimetic <- subset(test, (end %in% c("K","R")))[,c("variant","start","end","position","score")]
nonphospho_mimetic$color <- "blue"
phospho_mimetic <- subset(test, end %in% c("E","D"))[,c("variant","start","end","position","score")]
phospho_mimetic$color <- "red"
phospho_frame <- rbind(nonphospho_mimetic, phospho_mimetic)
orig_phosphosite <- subset(pten_variants, (position %in% pten_phosphorylation_sites) & class == "synonymous")[,c("variant","start","end","position","score")]
orig_phosphosite$color <- "black"
all_frame <- subset(pten_variants, (position %in% pten_phosphorylation_sites) & (end %in% c("A","G","M","N","Q","C")) & (class == "missense"))[,c("variant","start","end","position","score")]
all_frame$label <- factor(paste(all_frame$start,all_frame$position,sep=""), levels = c("S385")) #"S229","T321","Y336","T366","S370","S380","T382","T383",
phosphosite_plot <- ggplot() + geom_point(data = all_frame, aes(x = label, y = score), color = "grey50") +
geom_point(data = orig_phosphosite, aes(x = factor(paste(orig_phosphosite$start,orig_phosphosite$position,sep="")), y = score, color = as.character(orig_phosphosite$color)), color = as.character(orig_phosphosite$color)) +
geom_point(data = phospho_frame, aes(x = factor(paste(phospho_frame$start,phospho_frame$position,sep="")), y = score, color = as.character(phospho_frame$color)), color = as.character(phospho_frame$color)) +
geom_text(data = phospho_frame, aes(x = factor(paste(phospho_frame$start,phospho_frame$position,sep="")), y = score, label = end), size = 3, position = position_nudge(x = 0.3)) +
theme(panel.grid.major.x = element_blank(), panel.grid.major.y = element_line( size=.1, color="black" )) +
xlab("Phosphorylation\nsite") + ylab("Abundance score")
ggsave(file = "Output/PTEN_Phosphosite_plot.pdf", phosphosite_plot, height = 2, width = 1.6)
phosphosite_plot
First, show positional median representatives in Pymol.
Making a colorscale for coloring my Pymol.
Making a surface representation of the 20% positions with the lowest and highest scores, as well as positions with a high percentage of highly stabilizing tpmt_variants.
# Make a new vector for the 20% lowest positional scores
pymol_tpmt_lowest_scores <- c()
# Make a command that creates a new object of only the 25% most destabilization-prone positions
pymol_tpmt_lowest_scores <- append(pymol_tpmt_lowest_scores, paste("create lowest_20, tpmt and resi ",gsub(", ","+",toString(subset(tpmt_positions, score < quantile(tpmt_positions$score, 0.2, na.rm = TRUE))$position)),"", sep= ""))
# Make the destabilized positions a black surface representation as well
pymol_tpmt_lowest_scores <- append(pymol_tpmt_lowest_scores, c("show surface, lowest_20","set transparency, 0.5, lowest_20","orient tpmt"))
# Do the same thing for the 20% highest positional scores
pymol_tpmt_highest_scores <- c()
# Make a command that creates a new object of only the 25% most destabilization-prone positions
pymol_tpmt_highest_scores <- append(pymol_tpmt_highest_scores, paste("create highest_20, tpmt and resi ",gsub(", ","+",toString(subset(tpmt_positions, score > quantile(tpmt_positions$score, 0.8, na.rm = TRUE))$position)),"", sep= ""))
# Make the destabilized positions a black surface representation as well
pymol_tpmt_highest_scores <- append(pymol_tpmt_highest_scores, c("show surface, highest_20","set transparency, 0.5, highest_20","orient tpmt"))
# Find positions with very high single variant scores, which will be shown as spheres
# Only select for missense tpmt_variants that are higher than the top 10% synonymous tpmt_variants
# Many are involved in lipid binding (2015, Wei, Roberts, JBiolChem)
variant_tpmt_high_scores <- subset(tpmt_variants, class == "missense" & score > quantile(subset(tpmt_variants, class == "synonymous")$score, 0.90))
row.names(variant_tpmt_high_scores) <- NULL
tpmt_positions_stable <- data.frame(table(subset(variant_tpmt_high_scores, class == "missense")$position))
tpmt_positions_stable <- tpmt_positions_stable[order(tpmt_positions_stable$Freq,decreasing = TRUE),]
row.names(tpmt_positions_stable) <- NULL
colnames(tpmt_positions_stable) <- c("position","count_10percent")
tpmt_positions_all <- data.frame(table(subset(tpmt_variants, class == "missense")$position))
colnames(tpmt_positions_all) <- c("position","count_all")
tpmt_positional_stable_frequency <- merge(tpmt_positions_stable, tpmt_positions_all, by = "position", all.x = TRUE)
tpmt_positional_stable_frequency$stable_frequency <- tpmt_positional_stable_frequency$count_10percent / tpmt_positional_stable_frequency$count_all
tpmt_positional_stable_frequency <- subset(tpmt_positional_stable_frequency, count_all >= 5)
tpmt_positional_stable_frequency <- tpmt_positional_stable_frequency[order(-tpmt_positional_stable_frequency$count_10percent),]
tpmt_positional_stable_frequency <- tpmt_positional_stable_frequency[order(-tpmt_positional_stable_frequency$stable_frequency),]
tpmt_positional_stable_frequency <- subset(tpmt_positional_stable_frequency, stable_frequency > quantile(tpmt_positional_stable_frequency$stable_frequency, 1/4))
tpmt_positional_stable_frequency <- tpmt_positional_stable_frequency[order(tpmt_positional_stable_frequency$position),]
pymol_most_stabilizing_variations_vector <- c()
pymol_most_stabilizing_variations_vector <- append(pymol_most_stabilizing_variations_vector, c(paste("select stabilizing_tpmt_variants, highest_20 and resi ", gsub(", ","+",toString(tpmt_positional_stable_frequency$position)),sep=""),"set sphere_scale, 1.25, stabilizing_tpmt_variants","show spheres, stabilizing_tpmt_variants","color red, stabilizing_tpmt_variants","set sphere_transparency=0.2, stabilizing_tpmt_variants"))
fileConn<-file("Output/Pymol_3_TPMT_lowest_highest_scores.pml")
writeLines(c(pymol_tpmt_setup,pymol_tpmt_color_setup_vector,pymol_tpmt_color_score_vector,pymol_tpmt_lowest_scores,pymol_tpmt_standard_view,
paste("png ",markdown_directory,"/Output/Pymol_3_TPMT_lowest_highest_scores.png, width=10cm, dpi=300, ray=1",sep="")
), fileConn)
close(fileConn)
Now let’s see if I can combine both the low-score hbonds and salt bridges to make groupings in 3-dimensional space.
schbond_saltbridge_tpmt_positions <- subset(tpmt_positions, (saltbridge_unique_partners >= 1 & abundance_class == "intolerant") | (schbond_unique_partners >= 1 & abundance_class == "intolerant"))
schbond_saltbridge_positions <- schbond_saltbridge_tpmt_positions[,c("sc_xca","sc_yca","sc_zca")]
rownames(schbond_saltbridge_positions) <- schbond_saltbridge_tpmt_positions$position
schbond_saltbridge_frame <- data.frame(matrix(ncol = nrow(schbond_saltbridge_tpmt_positions), nrow = nrow(schbond_saltbridge_tpmt_positions)))
colnames(schbond_saltbridge_frame) <- schbond_saltbridge_tpmt_positions$position; rownames(schbond_saltbridge_frame) <- schbond_saltbridge_tpmt_positions$position
for(x in 1:nrow(schbond_saltbridge_frame)){
for(y in 1:nrow(schbond_saltbridge_frame)){
schbond_saltbridge_frame[x,y] <- sqrt((abs(schbond_saltbridge_tpmt_positions$sc_xca[x] - schbond_saltbridge_tpmt_positions$sc_xca[y]))^2 +
(abs(schbond_saltbridge_tpmt_positions$sc_yca[x] - schbond_saltbridge_tpmt_positions$sc_yca[y]))^2 +
(abs(schbond_saltbridge_tpmt_positions$sc_zca[x] - schbond_saltbridge_tpmt_positions$sc_zca[y]))^2)
}
}
schbond_saltbridge_frame2 <- dist(schbond_saltbridge_positions, method = "euclidean")
fit <- hclust(schbond_saltbridge_frame2, method="ward.D2")
#ggdendrogram(fit, rotate = FALSE, size = 2)
dhc <- as.dendrogram(fit)
ddata <- dendro_data(dhc, type = "rectangle")
schbond_saltbridge_frame2_cut <- cutree(fit, h = 10)
schbond_saltbridge_frame2_cut <- data.frame(schbond_saltbridge_frame2_cut)
schbond_saltbridge_frame2_cut$label <- rownames(schbond_saltbridge_frame2_cut)
colnames(schbond_saltbridge_frame2_cut) <- c("group","label")
ddata_labels <- data.frame(ddata$labels)
ddata_labels <- merge(ddata_labels, schbond_saltbridge_frame2_cut, by = "label")
ddata_labels$group <- as.factor(ddata_labels$group)
Destabilizing_hbond_saltbridge_tree <- ggplot(segment(ddata)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = ddata_labels, aes(x = x, y = y-0.2, label = label, vjust = 1, color = group)) +
geom_hline(yintercept = 10, linetype = 2) + xlab(NULL) + ylab(NULL) + scale_x_continuous(breaks = NULL) + theme(axis.text.x = element_blank())
ggsave(file = "Output/TPMT_Destabilizing_hbond_saltbridge_tree.pdf", Destabilizing_hbond_saltbridge_tree, height = 2.5, width = 5)
Destabilizing_hbond_saltbridge_tree
mergy <- merge(schbond_saltbridge_frame, schbond_saltbridge_frame2_cut, by = "row.names")
group1 <- subset(mergy, group == 1)
Finding residues invoved in Hbonds or Salt Bridges with really low scores.
pymol_tpmt_hbonds_saltbridges_vector <- c()
pymol_tpmt_hbonds_saltbridges_vector <- append(pymol_tpmt_hbonds_saltbridges_vector, c(paste("create sc_hbonds_unstable, resi ",gsub(", ", "+",toString(subset(tpmt_positions, schbond_unique_partners >= 1 & abundance_class == "low")$position))," and not name c+n+o", sep = ""), "show sticks, sc_hbonds_unstable", "show surface, sc_hbonds_unstable", "set transparency, 0.2, sc_hbonds_unstable"))
pymol_tpmt_hbonds_saltbridges_vector <- append(pymol_tpmt_hbonds_saltbridges_vector, c(paste("create saltbridge_unstable, resi ",gsub(", ","+",toString(subset(tpmt_positions, saltbridge_unique_partners >= 1 & abundance_class == "low")$position))," and not name c+n+o", sep = ""), "show sticks, saltbridge_unstable", "show surface, saltbridge_unstable", "set transparency, 0.2, saltbridge_unstable"))
group_color_vector <- c("tv_red","slate","sand","limegreen","violet","orange","yellow")
pymol_tpmt_color_group_commands <- c()
for(x in 1:max(schbond_saltbridge_frame2_cut$group)){
pymol_tpmt_color_group_commands <- append(pymol_tpmt_color_group_commands, paste("select group", x, ", (sc_hbonds_unstable or saltbridge_unstable) and resi ",gsub(", ","+",toString(subset(schbond_saltbridge_frame2_cut, group == x)$label)), " and not name c+n+o; color ",group_color_vector[x],", group", x, sep=""))
}
tpmt_hbonds_fileConn<-file("Output/Pymol_4_TPMT_Hbonds_saltbridges.pml")
writeLines(c(pymol_tpmt_setup,pymol_tpmt_color_setup_vector,pymol_tpmt_color_score_vector,pymol_tpmt_hbonds_saltbridges_vector,pymol_tpmt_standard_view,pymol_tpmt_color_group_commands,
paste("png ",markdown_directory,"/Output/Pymol_4_TPMT_Hbonds_saltbridges.png, width=10cm, dpi=300, ray=1",sep="")), tpmt_hbonds_fileConn)
close(tpmt_hbonds_fileConn)
custom_colorscale <- c("low" = "#3366ff","possibly_low" = "#b3d9ff","possibly_wt-like" = "#ffcccc","wt-like" = "#F8766D","dominant_negative" = "yellow","high" = "brown","unknown" = "grey75")
pten_variants[is.na(pten_variants$abundance_class),"abundance_class"] <- "unknown"
This is a plot showing some representative variants from each abundance class.
errorbar_example_dataframe <- data.frame(c(subset(pten_variants, abundance_class == "low")[1,c("score","lower_ci","upper_ci","variant","expts")]))
errorbar_example_dataframe$variant <- as.character(errorbar_example_dataframe$variant)
errorbar_example_dataframe <- rbind(errorbar_example_dataframe, c(subset(pten_variants, abundance_class == "possibly_low")[5,c("score","lower_ci","upper_ci","variant","expts")]))
errorbar_example_dataframe <- rbind(errorbar_example_dataframe, c(subset(pten_variants, abundance_class == "possibly_wt-like")[1,c("score","lower_ci","upper_ci","variant","expts")]))
errorbar_example_dataframe <- rbind(errorbar_example_dataframe, c(subset(pten_variants, abundance_class == "wt-like")[2,c("score","lower_ci","upper_ci","variant","expts")]))
rownames(errorbar_example_dataframe) <- c("low","possibly low","possibly wt-like","wt-like")
classification_example_plot <- ggplot() +
geom_density(data = subset(pten_variants, class == "synonymous"), aes(x = score, y = ..scaled..), fill = "red", alpha = 0.5) +
geom_point(data = errorbar_example_dataframe, aes(x = as.numeric(score), y = c(1.5,2,2.5,3))) +
geom_errorbarh(aes(y = c(1.5,2,2.5,3), x = as.numeric(errorbar_example_dataframe$lower_ci), xmin = as.numeric(errorbar_example_dataframe$lower_ci), xmax = as.numeric(errorbar_example_dataframe$upper_ci)), linetype = 1) +
annotate("text", x = errorbar_example_dataframe$score, y = c(1.5,2,2.5,3)+0.15, label = errorbar_example_dataframe$variant) +
annotate("text", x = 1, y = 1.2, label = "WT-synonyms", color = "red") +
annotate("text", x = 0.02, y = c(1.5,2,2.5,3), label = c("Low","Possibly Low","Possibly WT-like","WT-like"), hjust = 0) +
xlab("PTEN abundance score") + theme(panel.grid.major.y = element_blank()) + geom_vline(xintercept = quantile(subset(pten_variants, class == "synonymous")$score,0.05), linetype = 2) + ylab("WT synonym density") +
scale_y_continuous(breaks = c(0,0.5,1))
ggsave(file = "Output/PTEN_classification_example_plot.pdf", classification_example_plot, height = 3, width = 5)
print(classification_example_plot)
## [1] "Total missense variants possible by SNV: 2391"
## [1] "Total missense variants possible by SNV with a score in our experiment: 1366"
## [1] "Total (germline) missense PTEN variants in Clinvar: 216"
## [1] "Total (germline) missense PTEN variants in Clinvar also found in our experiment: 131"
## [1] "Total (germline) missense PTEN variants in Clinvar also found in our experiment with low abundance: 0.24"
## [1] "Of Clinvar annotated, the number of known pathogenic variants: 41"
## [1] "Of known pathogenic, the number observed in our experiment: 25"
## [1] "Of known pathogenic in our experiment, the number that were low abundance: 16"
## [1] "The fraction is then: 0.64"
## [1] "Of known pathogenic in our experiment, the number that were possibly low abundance: 3"
## [1] "Of known pathogenic in our experiment, the remaining variants not of low abundance or possibly low abundance: D24G, H93R, G129E, R130L, T131I, R234Q"
## [1] "Of Clinvar annotated, the number of likely pathogenic variants: 41"
## [1] "Of likely pathogenic, the number observed in our experiment: 23"
## [1] "Of likely pathogenic in our experiment, the number that were low abundance: 10"
## [1] "The fraction is then: 0.43"
## [1] "Of Clinvar annotated, the number of VUS: 134"
## [1] "Of VUS, the number observed in our experiment: 83"
## [1] "Of VUS in our experiment, the number that were low abundance: 22"
## [1] "The fraction is then: 0.27"
## [1] "Total missense variants possible by SNV and not found in Clinvar: 2175"
## [1] "Total missense variants possible by SNV and not found in Clinvar, that we now provide scores for: 1235"
## [1] "Total missense variants possible by SNV and not found in Clinvar, that we find to be low abundance: 275"
## [1] "Total missense variants possible by SNV and not listed as pathogenic, that we find to be low abundance: 307"
## [1] "Total missense variants possible by SNV, that we find to be low abundance: 1366"
## [1] "Total missense variants possible by SNV, that we find to be low abundance: 323"
## [1] "Percent of missense variants possible by SNV that we find to be low abundance: 0.24"
Look at the distribution of stability scores across pathogenic and uncertain.
benign_frame <- subset(pten_variants, variant %in% c("A79T","S294R","P354Q"))
possible_snv_plot <- ggplot() + scale_x_continuous(limits = c(min(pten_variants$score, na.rm = TRUE), max(pten_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("All possible SNV") + scale_y_continuous(limits = c(0,320), breaks = seq(0,200,50), expand = c(0,0.1)) +
geom_histogram(data = subset(pten_variants, snv == 1), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_point(data = benign_frame, aes(x = score, y = 225, color = abundance_class), size = 1) + annotate("text", x = benign_frame$score, y = 240, label = benign_frame$variant, angle = 45, size = 2.5, color = "black", hjust = 0) +
geom_vline(xintercept = pten_synonymous_5th, linetype = 2) + scale_fill_manual(values = custom_colorscale) + scale_color_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/Classification_all_snv.pdf",possible_snv_plot,height = 1, width = 2)
possible_snv_plot
clinvar_pathog_plot <- ggplot() + scale_x_continuous(limits = c(min(pten_variants$score, na.rm = TRUE), max(pten_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("Pathogenic") + scale_y_continuous(breaks = seq(0,10,1), expand = c(0,0.1)) +
geom_histogram(data = subset(pten_variants, clinvar_pathog == 1 & (class == "missense")), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_vline(xintercept = pten_synonymous_5th, linetype = 2) + scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/Classification_Clinvar_pathogenic.pdf",clinvar_pathog_plot,height = 1, width = 2)
clinvar_pathog_plot
clinvar_likely_pathog_plot <- ggplot() + scale_x_continuous(limits = c(min(pten_variants$score, na.rm = TRUE), max(pten_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("Likely Pathogenic") + scale_y_continuous(breaks = seq(0,10,1), expand = c(0,0.1)) +
geom_histogram(data = subset(pten_variants, clinvar_likely_pathog == 1 & (class == "missense")), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_vline(xintercept = pten_synonymous_5th, linetype = 2) + scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/Classification_Clinvar_likely_pathogenic.pdf",clinvar_likely_pathog_plot,height = 1, width = 2)
clinvar_likely_pathog_plot
clinvar_vus_plot <- ggplot() + scale_x_continuous(limits = c(min(pten_variants$score, na.rm = TRUE), max(pten_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("VUS") + scale_y_continuous(breaks = seq(0,20,5), expand = c(0,0.1)) +
geom_histogram(data = subset(pten_variants, clinvar_uncertain == 1 & (class == "missense")), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_vline(xintercept = pten_synonymous_5th, linetype = 2) + scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/Classification_Clinvar_VUS.pdf",clinvar_vus_plot,height = 1, width = 2)
clinvar_vus_plot
pten_variants$above_gnomad_allele_count_threshold_for_cowdens <- NA
collective_pten_cowdens_allele_frequency <- 1/400000
for(x in 1:nrow(pten_variants)){
if(!is.na(pten_variants$gnomad_allele_count[x])){
if(pten_variants$gnomad_allele_count[x] > qbinom(0.99, size = pten_variants$gnomad_allele_number[x], collective_pten_cowdens_allele_frequency / 0.95)){
pten_variants$above_gnomad_allele_count_threshold_for_cowdens[x] <- "above threshold"
if(pten_variants$gnomad_allele_count[x] <= qbinom(0.99, size = pten_variants$gnomad_allele_number[x], collective_pten_cowdens_allele_frequency / 0.95)){
pten_variants$above_gnomad_allele_count_threshold_for_cowdens[x] <- "below threshold"
}
}
}
}
subset(pten_variants, above_gnomad_allele_count_threshold_for_cowdens == "above threshold")[,c("variant","score","abundance_class")]
## variant score abundance_class
## 1592 A79T 0.9981539 wt-like
## 3560 Y176C NA unknown
## 4162 M205V 0.7013760 possibly_low
## 5973 S294R 0.9242193 wt-like
## 6044 Q298E 0.7729313 possibly_wt-like
## 7192 P354Q 1.1960480 wt-like
## 7654 Y377F 0.8040139 possibly_wt-like
Likely_benign_plot <- ggplot() + scale_x_continuous(limits = c(min(pten_variants$score, na.rm = TRUE), max(pten_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("Likely to be benign") + scale_y_continuous(breaks = c(0,1,2), expand = c(0,0.1)) +
geom_histogram(data = subset(pten_variants, above_gnomad_allele_count_threshold_for_cowdens == "above threshold"), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_vline(xintercept = pten_synonymous_5th, linetype = 2) + scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/Likely_benign_plot.pdf",Likely_benign_plot,height = 1.5, width = 3)
Likely_benign_plot
Making bar charts to show pathogenicity distributions across cancers.
clinvar_list <- c("snv","clinvar_uncertain","clinvar_likely_pathog","clinvar_pathog")
clinvar_table <- data.frame("clinvar_class" = matrix(c("low","possibly_low","possibly_wt-like","wt-like"), nrow = 4, ncol = 1))
for(x in clinvar_list){
clinvar_table$snv <- table(subset(pten_variants, snv == 1 & class == "missense" & !(abundance_class %in% c("unknown","dominant_negative")))[,"abundance_class"]) / nrow(subset(pten_variants, snv == 1 & class == "missense" & !is.na(score)))
clinvar_table$clinvar_uncertain <- table(subset(pten_variants, clinvar_uncertain == 1 & class == "missense" & !(abundance_class %in% c("unknown","dominant_negative")))[,"abundance_class"]) / nrow(subset(pten_variants, clinvar_uncertain == 1 & class == "missense" & !is.na(score)))
clinvar_table$clinvar_likely_pathog <- table(subset(pten_variants, clinvar_likely_pathog == 1 & class == "missense" & !(abundance_class %in% c("unknown","dominant_negative")))[,"abundance_class"]) / nrow(subset(pten_variants, clinvar_likely_pathog == 1 & class == "missense" & !is.na(score)))
clinvar_table$clinvar_pathog <- table(subset(pten_variants, clinvar_pathog == 1 & class == "missense" & !(abundance_class %in% c("unknown","dominant_negative")))[,"abundance_class"]) / nrow(subset(pten_variants, clinvar_pathog == 1 & class == "missense" & !is.na(score)))
}
clinvar_table_melt <- melt(clinvar_table, id = "clinvar_class")
clinvar_table_melt$clinvar_class <- factor(clinvar_table_melt$clinvar_class, levels = c("wt-like","possibly_wt-like","possibly_low","low"))
clinvar_table_melt <- clinvar_table_melt[,c(1,2,4)]
colnames(clinvar_table_melt) <- c("abundance_class","clinvar_class","count")
pten_clinvar_bar_graphs <- ggplot() + geom_bar(data = clinvar_table_melt, aes(x = clinvar_class, y = count, fill = abundance_class), stat = "identity", position = "fill", color = "black") + coord_flip() + scale_fill_manual(values = custom_colorscale) + xlab(NULL) + ylab(NULL) + theme(legend.position = "none")
ggsave("Output/PTEN_clinvar_bar_graphs.pdf",pten_clinvar_bar_graphs,height = 2, width = 4)
pten_clinvar_bar_graphs
Performing a sampling test to see if the ratio of low abundance variants is statistically significant.
## This is the distribution from which I'll sample from
snv_subset <- pten_variants[pten_variants$snv == 1 & !is.na(pten_variants$score),c("variant","score","abundance_class")]
## This is the table I'll use to keep track of my results
clinvar_observed_expected_table <- data.frame("abundance_class" = matrix(c("low"), nrow = 1, ncol = 1))
## This is the three different groups I'll test
clinvar_list <- c("clinvar_pathog","clinvar_likely_pathog","clinvar_uncertain")
for(x in clinvar_list){
clinvar_subset <- pten_variants[pten_variants[,paste(x,sep="")] != 0 & !is.na(pten_variants$score),c("variant","score","abundance_class",paste(x,sep=""))]
sample_number <- nrow(clinvar_subset)
low_class_vector <- c()
total_iterations <- 10000
for(y in 1:total_iterations){
sample_iteration <- sample(snv_subset$abundance_class, size = sample_number, replace = TRUE)
low_class_vector <- append(low_class_vector, sum(sample_iteration == "low")/sample_number)
}
low_class_frame <- data.frame("percentage_low" = low_class_vector)
clinvar_observed_low_freq <- sum(clinvar_subset$abundance_class == "low")/nrow(clinvar_subset)
low_p_value <- sum(clinvar_observed_low_freq < low_class_frame$percentage_low) / total_iterations
p_value_vector <- c(low_p_value)
p_value_vector[p_value_vector == 0] <- 1/total_iterations
clinvar_observed_expected_table <- cbind(clinvar_observed_expected_table, p_value_vector)
}
colnames(clinvar_observed_expected_table) <- c("abundance_class",clinvar_list)
print(t(clinvar_observed_expected_table))
## [,1]
## abundance_class "low"
## clinvar_pathog "1e-04"
## clinvar_likely_pathog "0.0188"
## clinvar_uncertain "0.3799"
Ascribing potential new ClinVar variant pathogenicity classifications if low abundance is considered PS3.
## PTEN already has PP2 and PP3, thus 2 supporting pathogenic.
## Except for 3 variants, PTEN also has PM2
## Some sites, such as those very close to the active site, an also be considered PM1 (redundant with PM5)
## Depending on whether a pathogenic variant was observed at a residue, can have PM5
## Treating Low-abundance classification as PS3
pten_variants$likely_pathog_new <- 0
for(x in 1:nrow(pten_variants)){
if(!(pten_variants$position[x] %in% unique(subset(pten_variants, clinvar_pathog == 1)$position)) & pten_variants$abundance_class[x] == "low"){
pten_variants$likely_pathog_new[x] <- 1}
}
## Individual example I135k
i135k_reclass <- subset(pten_variants, variant == "I135K")[,c("variant","hgvs","snv","abundance_class","clinvar_likely_pathog","clinvar_uncertain","gnomad_allele_freq","predictor_fraction_deleterious")]
i135k_reclass$acmg_amp_criteria <- "PM2, PM5(I135V;PMID 17392703), PP2, PP3, PS3"
i135k_reclass$acmg_amp_summary <- "1S-2M-2P"
i135k_reclass$potential_reclassification <- "Pathogenic"
# Individual example K66E
k66e_reclass <- subset(pten_variants, variant == "K66E")[,c("variant","hgvs","snv","abundance_class","clinvar_likely_pathog","clinvar_uncertain","gnomad_allele_freq","predictor_fraction_deleterious")]
k66e_reclass$acmg_amp_criteria <- "PM2, PP1 (UW patient data), PP2, PP3, PP4 (UW patient data), PS3"
k66e_reclass$acmg_amp_summary <- "1S-1M-4P"
k66e_reclass$potential_reclassification <- "Pathogenic"
## How many low abundance uncertain or unknown single amino acid variants are there?
paste("Number of low abundance variants that could be considered likely pathogenic if low abundance considered experimental evidence of pathogenicity:",round(nrow(subset(pten_variants, class == "missense" & !is.na(score) & abundance_class == "low" & clinvar_uncertain != 1 & clinvar_likely_pathog != 1 & clinvar_pathog != 1)),2))
## [1] "Number of low abundance variants that could be considered likely pathogenic if low abundance considered experimental evidence of pathogenicity: 1090"
paste("Number of low abundance variants (possible by SNV) that could be considered likely pathogenic if low abundance considered experimental evidence of pathogenicity:",round(nrow(subset(pten_variants, class == "missense" & !is.na(score) & abundance_class == "low" & clinvar_uncertain != 1 & clinvar_likely_pathog != 1 & clinvar_pathog != 1 & snv == 1)),2))
## [1] "Number of low abundance variants (possible by SNV) that could be considered likely pathogenic if low abundance considered experimental evidence of pathogenicity: 275"
## Finding the variants of uncertain significance or unannotated variants I can likely reclassify as likely pathogenic
likely_pathogenic_reclass <- rbind(subset(pten_variants, class == "missense" & clinvar_uncertain == 1 & clinvar_likely_pathog != 1 & clinvar_pathog != 1 & abundance_class == "low" & predictor_fraction_deleterious > 0 & !(variant %in% c("I135K","K66E")))[,c("variant","hgvs","snv","abundance_class","clinvar_likely_pathog","clinvar_uncertain","gnomad_allele_freq","predictor_fraction_deleterious")],
subset(pten_variants, class == "missense" & clinvar_likely_pathog != 1 & clinvar_uncertain != 1 & clinvar_pathog != 1 & abundance_class == "low" & predictor_fraction_deleterious > 0 & !(variant %in% c("I135K","K66E")))[,c("variant","hgvs","snv","abundance_class","clinvar_likely_pathog","clinvar_uncertain","gnomad_allele_freq","predictor_fraction_deleterious")])
likely_pathogenic_reclass$acmg_amp_criteria <- ""
likely_pathogenic_reclass$acmg_amp_summary <- ""
for(x in 1:nrow(likely_pathogenic_reclass)){
if(likely_pathogenic_reclass$predictor_fraction_deleterious[x] >= 0.9){
likely_pathogenic_reclass$acmg_amp_criteria[x] <- "Minimally: PM2, PP2, PP3, PS3"
likely_pathogenic_reclass$acmg_amp_summary[x] <- "Minimally: 1S-1M-2P"}
if(likely_pathogenic_reclass$predictor_fraction_deleterious[x] < 0.9){
likely_pathogenic_reclass$acmg_amp_criteria[x] <- "Minimally: PM2, PP2, PS3"
likely_pathogenic_reclass$acmg_amp_summary[x] <- "Minimally: 1S-1M-1P"}
}
likely_pathogenic_reclass$potential_reclassification <- "Likely pathogenic"
write.table(file = "Output/Supplementary_table_7_pten_reclassifications.tsv", rbind(i135k_reclass, k66e_reclass, likely_pathogenic_reclass), row.names = FALSE, quote = FALSE, sep = "\t")
Seeing the distribution of PTEN abudance classes across cancers in cancer genomics data.
## Making a new class for dominant negatives
pten_variants[pten_variants$variant == "R130G" | pten_variants$variant == "R130Q" | pten_variants$variant == "C124S" | pten_variants$variant == "G129E","abundance_class"] <- "dominant_negative"
## Determine the abundance distribution observed across cancers
cancer_list <- c("colorectal","brain","nsclc","endometrial","uterine","breast","melanoma")
cancer_observed_table <- data.frame("abundance_class" = matrix(c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"), nrow = 6, ncol = 1))
for(x in cancer_list){
temp_frame <- data.frame()
temp_subset <- pten_variants[pten_variants[,paste("cancer_",x,"_count",sep="")] != 0 & (!is.na(pten_variants$score) | pten_variants$abundance_class == "dominant_negative" | pten_variants$abundance_class == "unknown"),c("variant","score","abundance_class",paste("cancer_",x,"_count",sep=""))]
for(y in 1:nrow(temp_subset)){
for(z in 1:temp_subset[y,paste("cancer_",x,"_count",sep="")]){
temp_frame <- rbind(temp_frame, temp_subset[y,])}
}
temp_frame$abundance_class <- factor(temp_frame$abundance_class, levels = c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"))
assign(x = paste("cancer_",x,"_frame",sep=""), value = temp_frame)
cancer_observed_table <- cbind(cancer_observed_table, data.frame(table(temp_frame[,"abundance_class"])/nrow(temp_frame))$Freq)
}
colnames(cancer_observed_table) <- c("abundance_class",cancer_list)
cancer_observed_table_melt <- melt(cancer_observed_table, id = "abundance_class")
cancer_observed_table_melt$abundance_class <- factor(cancer_observed_table_melt$abundance_class, levels = c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"))
## Make expected abundance distribution based on the mutation spectrum (null hypothesis)
cancer_expected_table <- data.frame("abundance_class" = matrix(c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"), nrow = 6, ncol = 1))
for(x in cancer_list){
temp_frame <- data.frame("abundance_class" = sample(pten_variants$abundance_class, size = 10000, prob = pten_variants[,paste(x,"_cancer","_expected",sep="")], replace = TRUE))
temp_frame$abundance_class <- factor(temp_frame$abundance_class, levels = c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"))
assign(x = paste("cancer_",x,"_frame",sep=""), value = temp_frame)
cancer_expected_table <- cbind(cancer_expected_table, data.frame(table(temp_frame[,"abundance_class"])/nrow(temp_frame))$Freq)
}
colnames(cancer_expected_table) <- c("abundance_class",cancer_list)
cancer_expected_table_melt <- melt(cancer_expected_table, id = "abundance_class")
cancer_expected_table_melt$abundance_class <- factor(cancer_expected_table_melt$abundance_class, levels = c("unknown","dominant_negative","wt-like","possibly_wt-like","possibly_low","low"))
## Making a hybrid chart that shows both observed and expected abundance frequencies across cancers
cancer_observed_table_melt$group <- "Observed"
cancer_observed_table_melt$width <- 1.2
cancer_expected_table_melt$group <- "Expected"
cancer_expected_table_melt$width <- .8
observed_expected_table_melt <- rbind(cancer_observed_table_melt, cancer_expected_table_melt)
observed_expected_table_melt$variable <- as.character(observed_expected_table_melt$variable)
for(x in 1:nrow(observed_expected_table_melt)){
if(observed_expected_table_melt$variable[x] == "colorectal"){observed_expected_table_melt$variable[x] <- paste("Colorectal"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_colorectal_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "brain"){observed_expected_table_melt$variable[x] <- paste("Brain"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_brain_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "nsclc"){observed_expected_table_melt$variable[x] <- paste("NSCLC"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_nsclc_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "endometrial"){observed_expected_table_melt$variable[x] <- paste("Endometrial"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_endometrial_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "uterine"){observed_expected_table_melt$variable[x] <- paste("Uterine"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_uterine_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "breast"){observed_expected_table_melt$variable[x] <- paste("Breast"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_breast_count"]),")",sep="")}
if(observed_expected_table_melt$variable[x] == "melanoma"){observed_expected_table_melt$variable[x] <- paste("Melanoma"," (n = ",sum(subset(pten_variants, !is.na(score))[,"cancer_melanoma_count"]),")",sep="")}
}
cancer_combined_bar_graphs <- ggplot() + geom_bar(data = observed_expected_table_melt, aes(x = group, y = value, fill = abundance_class, width = width), position = "fill", stat = "identity", color = "black") +
coord_flip() + xlab(NULL) + ylab("Proportion of stability score class") + scale_y_continuous(expand = c(0,0.01)) + theme(legend.position = "top") + scale_fill_manual(values = custom_colorscale) +
facet_wrap( ~ variable, nrow = 7, strip.position = "right") + theme(strip.text.y = element_text(angle=0), strip.background = element_rect(fill="white"))
ggsave("Output/Cancer_combined_bar_graphs.pdf",cancer_combined_bar_graphs,height = 4.5, width = 4)
cancer_combined_bar_graphs
cancer_list <- c("colorectal","brain","nsclc","endometrial","uterine","breast","melanoma")
cancer_observed_expected_table <- data.frame("abundance_class" = matrix(c("dominant_negative","low","p38s"), nrow = 3, ncol = 1))
cancer_brain_table <- subset(pten_variants, cancer_brain_count > 0)[,c("variant","cancer_brain_count")]
cancer_uterine_table <- subset(pten_variants, cancer_uterine_count > 0)[,c("variant","cancer_uterine_count")]
cancer_breast_table <- subset(pten_variants, cancer_breast_count > 0)[,c("variant","cancer_breast_count")]
cancer_colorectal_table <- subset(pten_variants, cancer_colorectal_count > 0)[,c("variant","cancer_colorectal_count")]
cancer_nsclc_table <- subset(pten_variants, cancer_nsclc_count > 0)[,c("variant","cancer_nsclc_count")]
cancer_endometrial_table <- subset(pten_variants, cancer_endometrial_count > 0)[,c("variant","cancer_endometrial_count")]
cancer_melanoma_table <- subset(pten_variants, cancer_melanoma_count > 0)[,c("variant","cancer_melanoma_count")]
## Another table specific for the P38S variant
cancer_observed_variant_table <- data.frame("P38S",sum(cancer_colorectal_table[cancer_colorectal_table$variant == "P38S","cancer_colorectal_count"]),sum(cancer_brain_table[cancer_brain_table$variant == "P38S","cancer_brain_count"]),
sum(cancer_nsclc_table[cancer_nsclc_table$variant == "P38S","cancer_nsclc_count"]),sum(cancer_endometrial_table[cancer_endometrial_table$variant == "P38S","cancer_endometrial_count"]),
sum(cancer_uterine_table[cancer_uterine_table$variant == "P38S","cancer_uterine_count"]),sum(cancer_breast_table[cancer_breast_table$variant == "P38S","cancer_breast_count"]),
sum(cancer_melanoma_table[cancer_melanoma_table$variant == "P38S","cancer_melanoma_count"]))
colnames(cancer_observed_variant_table) <- c("variant",cancer_list)
## How many times to try resampling
total_iterations <- 10000
## Resampling test for the known dominant negative and low abundance classes.
for(x in cancer_list){
cancer_subset <- pten_variants[pten_variants[,paste(x,"_cancer","_expected",sep="")] != 0,c("variant","score","abundance_class",paste(x,"_cancer","_expected",sep=""))]
sample_number <- sum(pten_variants[,paste("cancer_",x,"_count",sep="")])
dominant_negative_class_vector <- c()
low_class_vector <- c()
for(y in 1:total_iterations){
sample_iteration <- sample(cancer_subset$abundance_class, size = sample_number, prob = cancer_subset[,paste(x,"_cancer","_expected",sep="")], replace = TRUE)
dominant_negative_class_vector <- append(dominant_negative_class_vector, sum(sample_iteration == "dominant_negative")/sample_number)
low_class_vector <- append(low_class_vector, sum(sample_iteration == "low")/sample_number)
}
dominant_negative_class_frame <- data.frame("percentage_dominant_negative" = dominant_negative_class_vector)
low_class_frame <- data.frame("percentage_low" = low_class_vector)
cancer_observed_dominant_negative_freq <- cancer_observed_table[cancer_observed_table$abundance_class == "dominant_negative",x]
cancer_observed_low_freq <- cancer_observed_table[cancer_observed_table$abundance_class == "low",x]
p38s_vector <- c()
for(y in 1:total_iterations){
sample_iteration <- sample(cancer_subset$variant, size = sample_number, prob = cancer_subset[,paste(x,"_cancer","_expected",sep="")], replace = TRUE)
p38s_vector <- append(p38s_vector, sum(sample_iteration == "P38S")/sample_number)
}
p38s_frame <- data.frame("percentage_p38s" = p38s_vector)
cancer_observed_p38s_freq <- cancer_observed_variant_table[cancer_observed_variant_table$variant == "P38S",x]
dominant_negative_p_value <- sum(cancer_observed_dominant_negative_freq < dominant_negative_class_frame$percentage_dominant_negative) / total_iterations
low_p_value <- sum(cancer_observed_low_freq < low_class_frame$percentage_low) / total_iterations
p38s_p_value <- sum(cancer_observed_p38s_freq < p38s_frame$percentage_p38s) / total_iterations
p_value_vector <- c(dominant_negative_p_value,low_p_value,p38s_p_value)
p_value_vector[p_value_vector == 0] <- 1/total_iterations
cancer_observed_expected_table <- cbind(cancer_observed_expected_table, p_value_vector)
}
colnames(cancer_observed_expected_table) <- c("abundance_class",cancer_list)
print(t(cancer_observed_expected_table))
## [,1] [,2] [,3]
## abundance_class "dominant_negative" "low" "p38s"
## colorectal "0.0001" "0.0032" "0.1978"
## brain "0.0001" "0.0001" "0.4296"
## nsclc "0.0001" "0.0001" "0.0638"
## endometrial "0.0001" "0.0002" "0.2284"
## uterine "0.0001" "0.0032" "0.3684"
## breast "0.0001" "0.0001" "0.2058"
## melanoma "0.0632" "0.0001" "0.0001"
## [1] "Abundance score for C124S: 1.14"
## [1] "Abundance score for R130G: 1.09"
## [1] "Abundance score for G129E: 0.76"
## [1] "The R130Q variant was not found in our library"
text_frequency_cutoff <- 0.05
Hotspot_variants_across_cancers <- ggplot() + geom_boxplot(data = pten_variants, aes(x = "1. Brain", y = cancer_brain_count / sum(cancer_brain_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_brain_count / sum(cancer_brain_count) > text_frequency_cutoff), aes(x = "1. Brain", y = (cancer_brain_count / sum(pten_variants$cancer_brain_count))),
label = subset(pten_variants, cancer_brain_count / sum(cancer_brain_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.1), color = "red", angle = -45, hjust = 0, size = 2) +
geom_boxplot(data = pten_variants, aes(x = "2. Colorectal", y = cancer_colorectal_count / sum(cancer_colorectal_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_colorectal_count / sum(cancer_colorectal_count) > text_frequency_cutoff), aes(x = "2. Colorectal", y = (cancer_colorectal_count / sum(pten_variants$cancer_colorectal_count))),
label = subset(pten_variants, cancer_colorectal_count / sum(cancer_colorectal_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.15), color = "red", angle = -45, hjust = 0, size = 2) +
geom_boxplot(data = pten_variants, aes(x = "3. Nsclc", y = cancer_nsclc_count / sum(cancer_nsclc_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_nsclc_count / sum(cancer_nsclc_count) > text_frequency_cutoff), aes(x = "3. Nsclc", y = (cancer_nsclc_count / sum(pten_variants$cancer_nsclc_count))),
label = subset(pten_variants, cancer_nsclc_count / sum(cancer_nsclc_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.15), color = "red", angle = -45, hjust = 0, size = 2) +
geom_boxplot(data = pten_variants, aes(x = "4. Endometrial", y = cancer_endometrial_count / sum(cancer_endometrial_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_endometrial_count / sum(cancer_endometrial_count) > text_frequency_cutoff), aes(x = "4. Endometrial", y = (cancer_endometrial_count / sum(pten_variants$cancer_endometrial_count))),
label = subset(pten_variants, cancer_endometrial_count / sum(cancer_endometrial_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.15), color = "red", angle = -45, hjust = 0, size = 2) +
geom_boxplot(data = pten_variants, aes(x = "5. Uterine", y = cancer_uterine_count / sum(cancer_uterine_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_uterine_count / sum(cancer_uterine_count) > text_frequency_cutoff), aes(x = "5. Uterine", y = (cancer_uterine_count / sum(pten_variants$cancer_uterine_count))),
label = subset(pten_variants, cancer_uterine_count / sum(cancer_uterine_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.15), color = "red", angle = -45, hjust = 0, size = 2) +
geom_boxplot(data = pten_variants, aes(x = "6. Breast", y = cancer_breast_count / sum(cancer_breast_count)), width = 0.1, alpha = 0.5) +
geom_boxplot(data = pten_variants, aes(x = "7. Melanoma", y = cancer_melanoma_count / sum(cancer_melanoma_count)), width = 0.1, alpha = 0.5) +
geom_text(data = subset(pten_variants, cancer_melanoma_count / sum(cancer_melanoma_count) > text_frequency_cutoff), aes(x = "7. Melanoma", y = (cancer_melanoma_count / sum(pten_variants$cancer_melanoma_count))),
label = subset(pten_variants, cancer_melanoma_count / sum(cancer_melanoma_count) > text_frequency_cutoff)[,"variant"], position = position_nudge(x = 0.15), color = "red", angle = -45, hjust = 0, size = 2) + xlab(NULL) + ylab("Variant frequency in cancer type") +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
ggsave("Output/Hotspot_variants_across_cancers.pdf",Hotspot_variants_across_cancers,height = 2.5, width = 4)
Hotspot_variants_across_cancers
## [1] "Number of PTEN variants in the Melanoma dataset (n = 77)"
## [1] "P38S amounted to 10.39 % of the observed SNV variants in melanoma"
## [1] "Abundance score for P38S: 1.14"
Here we see how the PTEN VAMP-seq scores compare to the Rosetta computational predictor of folding free energies. Notably, Rosetta suggested that P38S is thermodynamically unstable, which contrasts the WT-like VAMP-seq score, and the WT-like steady-state expression by Western Blot (Fig 4f and Supplementary Fig 5e).
## These are the variants I want to highlight in the plot
notables_list <- c("WT","C124S","C136W","C136R")
pten_ddg_subset <- subset(pten_variants, !is.na(pten_variants$ddg))[,c("variant","ddg","score","start","end","abundance_class")]
pten_ddg_subset$ddg <- as.numeric(pten_ddg_subset$ddg)
pten_ddg_subset$score <- as.numeric(pten_ddg_subset$score)
pten_ddg_subset$position <- gsub("[^0-9]","",pten_ddg_subset$variant)
n <- 1000
x <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2))
df = data.frame(x); colnames(df) = c("x","y")
commonTheme = list(labs(color="Density",fill="Density",
x="Rosetta delta delta G",
y="Abundance score"),
theme_bw(),
theme(legend.position = "none",
legend.justification=c(0,1)))
PTEN_ddg_density_plot <- ggplot(data=pten_ddg_subset,aes(ddg,score)) +
geom_hline(yintercept = 1, color = "grey75", linetype = 2) + geom_vline(xintercept = 0, color = "grey75", linetype = 2) +
geom_density2d(aes(colour=..level..)) +
scale_colour_gradient(low="green",high="red") +
geom_point(data = subset(pten_variants, class == "wt"), aes(x = ddg, y = score), shape = 4, size = 2) +
geom_point(data = subset(pten_ddg_subset, score < 17), aes(x = ddg, y = score), alpha = 0.2) + commonTheme +
geom_point(data = subset(pten_variants, variant %in% c("WT","P38S","C124S","G129E","R130G","C136R")), aes(x = ddg, y = score), color = "blue", size = 0.4) +
geom_point(data = subset(pten_variants, ddg > 17), aes(x = 17, y = score), shape = 4) +
geom_text(data = subset(pten_variants, variant == "C124S"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
geom_text(data = subset(pten_variants, variant == "P38S"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
geom_text(data = subset(pten_variants, variant == "R130G"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
geom_text(data = subset(pten_variants, variant == "G129E"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
geom_text(data = subset(pten_variants, variant == "C136R"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
geom_text(data = subset(pten_variants, variant == "WT"), aes(x = ddg, y = score + 0.025, label = variant), shape = 4, color = "blue") +
scale_x_continuous(limits = c(-5,17))
ggsave(file = "Output/PTEN_ddg_density_plot.pdf", PTEN_ddg_density_plot, height = 2.5, width = 3)
PTEN_ddg_density_plot
Here, we looked at the distribution of abundance classes given to the TPMT variants and furthermore highlight the abundance scores and classes for common TPMT alleles that have been previous characterized. Additionally, we plot the 120 rare TPMT variants observed in genome and exome studies, and suggest that these may be decreased-function variants and that the risk for thiopurine toxicity may be elevated in patients harboring them.
tpmt_pharmacoframe <- subset(tpmt_variants, variant %in% c("A80P","A154T","Y240C","Q179H","S125L","R215H","R226Q"))
TPMT_possible_snv_plot <- ggplot() +
geom_histogram(data = subset(tpmt_variants, snv == 1), aes(x = score, fill = factor(abundance_class, levels = c("high","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
geom_point(data = tpmt_pharmacoframe, aes(x = score, y = 200, color = abundance_class), size = 1, alpha = 0.5) + annotate("text", x = tpmt_pharmacoframe$score, y = 215, label = tpmt_pharmacoframe$variant, angle = 45, size = 2.5, color = "black", hjust = 0) + scale_color_manual(values = custom_colorscale) +
scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90")) +
scale_x_continuous(limits = c(min(tpmt_variants$score, na.rm = TRUE), max(tpmt_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("Possible SNV") +
scale_y_continuous(limits = c(0,290), breaks = seq(0,200,50), expand = c(0,0.1))
ggsave("Output/TPMT_possible_snv_plot.pdf",TPMT_possible_snv_plot,height = 1.4, width = 2)
TPMT_possible_snv_plot
TPMT_gnomad_plot <- ggplot() + geom_histogram(data = subset(tpmt_variants, !is.na(gnomad_allele_freq) & class == "missense"), aes(x = score, fill = factor(abundance_class, levels = c("high","R130G","wt-like","possibly_wt-like","possibly_low","low"))), binwidth = 0.1, alpha = 0.5, position="stack", color = "black") +
scale_x_continuous(limits = c(min(tpmt_variants$score, na.rm = TRUE), max(tpmt_variants$score, na.rm = TRUE))) + xlab(NULL) + ylab("Pathog") +
scale_y_continuous(expand = c(0,0.1)) + xlab(NULL) + ylab("Pathog") +
scale_fill_manual(values = custom_colorscale) + theme(legend.position="none", panel.grid.major = element_line(colour="grey90"))
ggsave("Output/TPMT_gnomad_plot.pdf",TPMT_gnomad_plot,height = 1, width = 1.97)
TPMT_gnomad_plot
Here, we compared TPMT abundance, either as VAMP-seq abundance scores or normalized EGFP:mCherry ratios, with TPMT alleles tested tested in a Red Blood Cell activity assay or dosing intensities used to treat patients. Notably, we observed that Thiopurine dose intensity was highly correlated with variant abundance.
TPMT_rbc_activity_plot <- ggplot() + geom_abline(slope = lm(tpmt_variants$rbc_assay ~ tpmt_variants$score)$coefficient[2], intercept = lm(tpmt_variants$rbc_assay ~ tpmt_variants$score)$coefficient[1], linetype = 2, color = "grey50") + geom_point(data = tpmt_variants, aes(x = score, y = rbc_assay)) + xlab("Abundance score") + ylab("RBC assay") +
geom_text(data = tpmt_variants, aes(x = score, y = rbc_assay + 0.015), label = tpmt_variants$variant, color = "red") +
scale_x_continuous(limits = c(min(subset(tpmt_variants, !is.na(rbc_assay))$score, na.rm = TRUE), max(subset(tpmt_variants, !is.na(rbc_assay))$score, na.rm = TRUE)), expand = c(0,0.2)) +
annotate("text", x = min(subset(tpmt_variants, !is.na(rbc_assay))$score, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(rbc_assay))$rbc_assay, na.rm = TRUE),
label = paste("r:",round(sqrt(summary(lm(tpmt_variants$rbc_assay ~ tpmt_variants$score))$r.squared),2)), hjust = 0.1) +
annotate("text", x = min(subset(tpmt_variants, !is.na(rbc_assay))$score, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(rbc_assay))$rbc_assay, na.rm = TRUE) - 1, label = paste("rho:",round(cor(tpmt_variants$rbc_assay, tpmt_variants$score, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0.1, parse = TRUE)
ggsave("Output/TPMT_rbc_activity_plot.pdf",TPMT_rbc_activity_plot,height = 3, width = 3)
TPMT_rbc_activity_plot
TPMT_rbc_activity_individual_plot <- ggplot() + geom_abline(slope = lm(tpmt_variants$rbc_assay ~ tpmt_variants$single_WTNorm)$coefficient[2], intercept = lm(tpmt_variants$rbc_assay ~ tpmt_variants$single_WTNorm)$coefficient[1], linetype = 2, color = "grey50") + geom_point(data = tpmt_variants, aes(x = single_WTNorm, y = rbc_assay)) + xlab("WT-normalized EGFP:mCherry\ngeometric mean") + ylab("RBC assay") +
geom_text(data = tpmt_variants, aes(x = single_WTNorm, y = rbc_assay - 0.015), label = tpmt_variants$variant, color = "red") +
scale_x_continuous(limits = c(min(subset(tpmt_variants, !is.na(rbc_assay))$single_WTNorm, na.rm = TRUE), max(subset(tpmt_variants, !is.na(rbc_assay))$single_WTNorm, na.rm = TRUE)), expand = c(0,0.2)) +
annotate("text", x = min(subset(tpmt_variants, !is.na(rbc_assay))$single_WTNorm, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(rbc_assay))$rbc_assay, na.rm = TRUE),
label = paste("r:",round(sqrt(summary(lm(tpmt_variants$rbc_assay ~ tpmt_variants$single_WTNorm))$r.squared),2)), hjust = 0.1) +
annotate("text", x = min(subset(tpmt_variants, !is.na(rbc_assay))$single_WTNorm, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(rbc_assay))$rbc_assay, na.rm = TRUE) - 1, label = paste("rho:",round(cor(tpmt_variants$rbc_assay, tpmt_variants$single_WTNorm, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0.1, parse = TRUE)
ggsave("Output/TPMT_rbc_activity_individual_plot.pdf",TPMT_rbc_activity_individual_plot,height = 3, width = 3)
TPMT_rbc_activity_individual_plot
TPMT_dosage_intensity_plot <- ggplot() + geom_abline(slope = lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$score)$coefficient[2], intercept = lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$score)$coefficient[1], linetype = 2, color = "grey50") +
geom_point(data = tpmt_variants, aes(x = score, y = average_dose_intensity)) + xlab("Abundance score") + ylab("Dosing intensity") +
geom_text(data = tpmt_variants, aes(x = score, y = average_dose_intensity + 0.015), label = tpmt_variants$variant, color = "red") +
scale_x_continuous(limits = c(min(subset(tpmt_variants, !is.na(average_dose_intensity))$score, na.rm = TRUE), max(subset(tpmt_variants, !is.na(average_dose_intensity))$score, na.rm = TRUE)), expand = c(0,0.2)) +
annotate("text", x = min(subset(tpmt_variants, !is.na(average_dose_intensity))$score, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(average_dose_intensity))$average_dose_intensity, na.rm = TRUE),
label = paste("r:",round(sqrt(summary(lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$score))$r.squared),2)), hjust = 0.1) +
annotate("text", x = min(subset(tpmt_variants, !is.na(average_dose_intensity))$score, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(average_dose_intensity))$average_dose_intensity, na.rm = TRUE) - 0.03, label = paste("rho:",round(cor(tpmt_variants$average_dose_intensity, tpmt_variants$score, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0.1, parse = TRUE)
ggsave("Output/TPMT_dosage_intensity_plot.pdf",TPMT_dosage_intensity_plot,height = 3, width = 3)
TPMT_dosage_intensity_plot
TPMT_dosage_intensity_individual_plot <- ggplot() + geom_abline(slope = lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$single_WTNorm)$coefficient[2], intercept = lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$single_WTNorm)$coefficient[1], linetype = 2, color = "grey50") +
geom_point(data = tpmt_variants, aes(x = single_WTNorm, y = average_dose_intensity)) + xlab("WT-normalized EGFP:mCherry\ngeometric mean") + ylab("Dosing intensity") +
geom_text(data = tpmt_variants, aes(x = single_WTNorm, y = average_dose_intensity + 0.012), label = tpmt_variants$variant, color = "red") +
annotate("text", x = min(subset(tpmt_variants, !is.na(average_dose_intensity))$single_WTNorm, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(average_dose_intensity))$average_dose_intensity, na.rm = TRUE), label = paste("r:",round(sqrt(summary(lm(tpmt_variants$average_dose_intensity ~ tpmt_variants$single_WTNorm))$r.squared),2)), hjust = 0.1) +
annotate("text", x = min(subset(tpmt_variants, !is.na(average_dose_intensity))$single_WTNorm, na.rm = TRUE), y = max(subset(tpmt_variants, !is.na(average_dose_intensity))$average_dose_intensity, na.rm = TRUE) - 0.03, label = paste("rho:",round(cor(tpmt_variants$average_dose_intensity, tpmt_variants$single_WTNorm, use="pairwise.complete.obs", method = "spearman"),2)), hjust = 0.1, parse = TRUE) +
scale_x_continuous(limits = c(min(subset(tpmt_variants, !is.na(average_dose_intensity))$single_WTNorm, na.rm = TRUE), max(subset(tpmt_variants, !is.na(average_dose_intensity))$single_WTNorm, na.rm = TRUE)), expand = c(0,0.2))
ggsave("Output/TPMT_dosage_intensity_individual_plot.pdf",TPMT_dosage_intensity_individual_plot,height = 3, width = 3)
TPMT_dosage_intensity_individual_plot
Finally, some basic stats of TPMT in our dataset.
## [1] "TPMT rare missense variants with MAF > 4e-6 (regardless of abundance score): 118"
## [1] "TPMT missense variants possible through SNV, with possibly low abundance scores: 777"
## [1] "TPMT rare missense variants with MAF > 4e-6: 96"
## [1] "TPMT rare missense variants with MAF > 4e-6, with low abundance scores: 14"
## [1] "Ratio of low abundance TPMT variants of those observed with MAF > 4e-6: 0.15"
## [1] "TPMT rare missense variants with MAF > 4e-6, with possibly low abundance scores: 17"
## [1] "Ratio of possibly low abundance TPMT variants of those observed with MAF > 4e-6: 0.18"
## [1] "TPMT missense variants possible through SNV, with low or possibly low abundance scores: 389"